[PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios

Tue Mar 17 00:30:16 PDT 2026

On Mon, Mar 16, 2026 at 2:25 PM Baolin Wang
<baolin.wang at linux.alibaba.com> wrote:
>
>
>
> On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
> > On 3/10/26 02:37, Baolin Wang wrote:
> >>
> >>
> >> On 3/7/26 4:02 PM, Barry Song wrote:
> >>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
> >>> <baolin.wang at linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> Yes. In addition, this will involve many architectures’ implementations
> >>>> and their differing TLB flush mechanisms, so it’s difficult to make a
> >>>> reasonable per-architecture measurement. If any architecture has a more
> >>>> efficient flush method, I’d prefer to implement an architecture‑specific
> >>>> clear_flush_young_ptes().
> >>>
> >>> Right! Since TLBI is usually quite expensive, I wonder if a generic
> >>> implementation for architectures lacking clear_flush_young_ptes()
> >>> might benefit from something like the below (just a very rough idea):
> >>>
> >>> int clear_flush_young_ptes(struct vm_area_struct *vma,
> >>>                   unsigned long addr, pte_t *ptep, unsigned int nr)
> >>> {
> >>>           unsigned long curr_addr = addr;
> >>>           int young = 0;
> >>>
> >>>           while (nr--) {
> >>>                   young |= ptep_test_and_clear_young(vma, curr_addr,
> >>> ptep);
> >>>                   ptep++;
> >>>                   curr_addr += PAGE_SIZE;
> >>>           }
> >>>
> >>>           if (young)
> >>>                   flush_tlb_range(vma, addr, curr_addr);
> >>>           return young;
> >>> }
> >>
> >> I understand your point. I’m concerned that I can’t test this patch on
> >> every architecture to validate the benefits. Anyway, let me try this on
> >> my X86 machine first.
> >
> > In any case, please make that a follow-up patch :)
>
> Sure. However, after investigating RISC‑V and x86, I found that
> ptep_clear_flush_young() does not flush the TLB on these architectures:
>
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *
>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
>
> I don't have access to other architectures, so I think we can postpone
> this optimization unless someone is interested in optimizing the TLB flush.

The comment is interesting. I think it likely applies to most
architectures, including ARM64. The main reason ARM64 doesn’t use
this approach is probably that it can issue tlbi_nosync and then
rely on a final dsb to ensure all invalidations are completed—
and tlbi_nosync itself is relatively cheap.

Thanks
Barry