[PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
Baolin Wang
baolin.wang at linux.alibaba.com
Mon Mar 9 18:37:55 PDT 2026
On 3/7/26 4:02 PM, Barry Song wrote:
> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
> <baolin.wang at linux.alibaba.com> wrote:
>>
>>
>>
>> On 3/7/26 5:07 AM, Barry Song wrote:
>>> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
>>> <baolin.wang at linux.alibaba.com> wrote:
>>>>
>>>> Currently, folio_referenced_one() always checks the young flag for each PTE
>>>> sequentially, which is inefficient for large folios. This inefficiency is
>>>> especially noticeable when reclaiming clean file-backed large folios, where
>>>> folio_referenced() is observed as a significant performance hotspot.
>>>>
>>>> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
>>>> an optimization to clear the young flags for PTEs within a contiguous range.
>>>> However, this is not sufficient. We can extend this to perform batched operations
>>>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>>>
>>>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>>>> of the young flags and flushing TLB entries, thereby improving performance
>>>> during large folio reclamation. And it will be overridden by the architecture
>>>> that implements a more efficient batch operation in the following patches.
>>>>
>>>> While we are at it, rename ptep_clear_flush_young_notify() to
>>>> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>>>>
>>>> Reviewed-by: Harry Yoo <harry.yoo at oracle.com>
>>>> Reviewed-by: Ryan Roberts <ryan.roberts at arm.com>
>>>> Signed-off-by: Baolin Wang <baolin.wang at linux.alibaba.com>
>>>
>>> LGTM,
>>>
>>> Reviewed-by: Barry Song <baohua at kernel.org>
>>
>> Thanks.
>>
>>>> ---
>>>> include/linux/mmu_notifier.h | 9 +++++----
>>>> include/linux/pgtable.h | 35 +++++++++++++++++++++++++++++++++++
>>>> mm/rmap.c | 28 +++++++++++++++++++++++++---
>>>> 3 files changed, 65 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>>> index d1094c2d5fb6..07a2bbaf86e9 100644
>>>> --- a/include/linux/mmu_notifier.h
>>>> +++ b/include/linux/mmu_notifier.h
>>>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>>> range->owner = owner;
>>>> }
>>>>
>>>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
>>>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
>>>> ({ \
>>>> int __young; \
>>>> struct vm_area_struct *___vma = __vma; \
>>>> unsigned long ___address = __address; \
>>>> - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
>>>> + unsigned int ___nr = __nr; \
>>>> + __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
>>>> __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
>>>> ___address, \
>>>> ___address + \
>>>> - PAGE_SIZE); \
>>>> + ___nr * PAGE_SIZE); \
>>>> __young; \
>>>> })
>>>>
>>>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>>>
>>>> #define mmu_notifier_range_update_to_read_only(r) false
>>>>
>>>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>>>> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>>>> #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>>> #define ptep_clear_young_notify ptep_test_and_clear_young
>>>> #define pmdp_clear_young_notify pmdp_test_and_clear_young
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 21b67d937555..a50df42a893f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>> }
>>>> #endif
>>>>
>>>> +#ifndef clear_flush_young_ptes
>>>> +/**
>>>> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
>>>> + * folio as old and flush the TLB.
>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>> + * @addr: Address the first page is mapped at.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @nr: Number of entries to clear access bit.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>> + * loop over ptep_clear_flush_young().
>>>> + *
>>>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>>> + * some PTEs might be write-protected.
>>>> + *
>>>> + * Context: The caller holds the page table lock. The PTEs map consecutive
>>>> + * pages that belong to the same folio. The PTEs are all in the same PMD.
>>>> + */
>>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> + unsigned long addr, pte_t *ptep, unsigned int nr)
>>>> +{
>>>> + int young = 0;
>>>> +
>>>> + for (;;) {
>>>> + young |= ptep_clear_flush_young(vma, addr, ptep);
>>>> + if (--nr == 0)
>>>> + break;
>>>> + ptep++;
>>>> + addr += PAGE_SIZE;
>>>> + }
>>>> +
>>>> + return young;
>>>> +}
>>>> +#endif
>>>
>>> We might have an opportunity to batch the TLB synchronization,
>>> using flush_tlb_range() instead of calling flush_tlb_page()
>>> one by one. Not sure the benefit would be significant though,
>>> especially if only one entry among nr has the young bit set.
>>
>> Yes. In addition, this will involve many architectures’ implementations
>> and their differing TLB flush mechanisms, so it’s difficult to make a
>> reasonable per-architecture measurement. If any architecture has a more
>> efficient flush method, I’d prefer to implement an architecture‑specific
>> clear_flush_young_ptes().
>
> Right! Since TLBI is usually quite expensive, I wonder if a generic
> implementation for architectures lacking clear_flush_young_ptes()
> might benefit from something like the below (just a very rough idea):
>
> int clear_flush_young_ptes(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep, unsigned int nr)
> {
> unsigned long curr_addr = addr;
> int young = 0;
>
> while (nr--) {
> young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
> ptep++;
> curr_addr += PAGE_SIZE;
> }
>
> if (young)
> flush_tlb_range(vma, addr, curr_addr);
> return young;
> }
I understand your point. I’m concerned that I can’t test this patch on
every architecture to validate the benefits. Anyway, let me try this on
my X86 machine first.
More information about the linux-arm-kernel
mailing list