[PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios

Tue Mar 17 18:37:12 PDT 2026

On 3/17/26 3:30 PM, Barry Song wrote:
> On Mon, Mar 16, 2026 at 2:25 PM Baolin Wang
> <baolin.wang at linux.alibaba.com> wrote:
>>
>>
>>
>> On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
>>> On 3/10/26 02:37, Baolin Wang wrote:
>>>>
>>>>
>>>> On 3/7/26 4:02 PM, Barry Song wrote:
>>>>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
>>>>> <baolin.wang at linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> Yes. In addition, this will involve many architectures’ implementations
>>>>>> and their differing TLB flush mechanisms, so it’s difficult to make a
>>>>>> reasonable per-architecture measurement. If any architecture has a more
>>>>>> efficient flush method, I’d prefer to implement an architecture‑specific
>>>>>> clear_flush_young_ptes().
>>>>>
>>>>> Right! Since TLBI is usually quite expensive, I wonder if a generic
>>>>> implementation for architectures lacking clear_flush_young_ptes()
>>>>> might benefit from something like the below (just a very rough idea):
>>>>>
>>>>> int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>                    unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>> {
>>>>>            unsigned long curr_addr = addr;
>>>>>            int young = 0;
>>>>>
>>>>>            while (nr--) {
>>>>>                    young |= ptep_test_and_clear_young(vma, curr_addr,
>>>>> ptep);
>>>>>                    ptep++;
>>>>>                    curr_addr += PAGE_SIZE;
>>>>>            }
>>>>>
>>>>>            if (young)
>>>>>                    flush_tlb_range(vma, addr, curr_addr);
>>>>>            return young;
>>>>> }
>>>>
>>>> I understand your point. I’m concerned that I can’t test this patch on
>>>> every architecture to validate the benefits. Anyway, let me try this on
>>>> my X86 machine first.
>>>
>>> In any case, please make that a follow-up patch :)
>>
>> Sure. However, after investigating RISC‑V and x86, I found that
>> ptep_clear_flush_young() does not flush the TLB on these architectures:
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>                             unsigned long address, pte_t *ptep)
>> {
>>          /*
>>           * On x86 CPUs, clearing the accessed bit without a TLB flush
>>           * doesn't cause data corruption. [ It could cause incorrect
>>           * page aging and the (mistaken) reclaim of hot pages, but the
>>           * chance of that should be relatively low. ]
>>           *
>>           * So as a performance optimization don't flush the TLB when
>>           * clearing the accessed bit, it will eventually be flushed by
>>           * a context switch or a VM operation anyway. [ In the rare
>>           * event of it not getting flushed for a long time the delay
>>           * shouldn't really matter because there's no real memory
>>           * pressure for swapout to react to. ]
>>           */
>>          return ptep_test_and_clear_young(vma, address, ptep);
>> }
>>
>> I don't have access to other architectures, so I think we can postpone
>> this optimization unless someone is interested in optimizing the TLB flush.
> 
> The comment is interesting. I think it likely applies to most
> architectures, including ARM64. The main reason ARM64 doesn’t use
> this approach is probably that it can issue tlbi_nosync and then
> rely on a final dsb to ensure all invalidations are completed—
> and tlbi_nosync itself is relatively cheap.

Actually, we both tried this a few years ago, but neither succeeded :).

My patch: https://lkml.org/lkml/2023/10/24/533

Your patch: 
https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/

Now I’m more inclined toward your approach, to align with MGLRU. It’s 
time to restart the discussion on this patch? :)