[PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios

Baolin Wang baolin.wang at linux.alibaba.com
Sun Mar 15 23:25:15 PDT 2026



On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
> On 3/10/26 02:37, Baolin Wang wrote:
>>
>>
>> On 3/7/26 4:02 PM, Barry Song wrote:
>>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
>>> <baolin.wang at linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Yes. In addition, this will involve many architectures’ implementations
>>>> and their differing TLB flush mechanisms, so it’s difficult to make a
>>>> reasonable per-architecture measurement. If any architecture has a more
>>>> efficient flush method, I’d prefer to implement an architecture‑specific
>>>> clear_flush_young_ptes().
>>>
>>> Right! Since TLBI is usually quite expensive, I wonder if a generic
>>> implementation for architectures lacking clear_flush_young_ptes()
>>> might benefit from something like the below (just a very rough idea):
>>>
>>> int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>                   unsigned long addr, pte_t *ptep, unsigned int nr)
>>> {
>>>           unsigned long curr_addr = addr;
>>>           int young = 0;
>>>
>>>           while (nr--) {
>>>                   young |= ptep_test_and_clear_young(vma, curr_addr,
>>> ptep);
>>>                   ptep++;
>>>                   curr_addr += PAGE_SIZE;
>>>           }
>>>
>>>           if (young)
>>>                   flush_tlb_range(vma, addr, curr_addr);
>>>           return young;
>>> }
>>
>> I understand your point. I’m concerned that I can’t test this patch on
>> every architecture to validate the benefits. Anyway, let me try this on
>> my X86 machine first.
> 
> In any case, please make that a follow-up patch :)

Sure. However, after investigating RISC‑V and x86, I found that 
ptep_clear_flush_young() does not flush the TLB on these architectures:

int ptep_clear_flush_young(struct vm_area_struct *vma,
			   unsigned long address, pte_t *ptep)
{
	/*
	 * On x86 CPUs, clearing the accessed bit without a TLB flush
	 * doesn't cause data corruption. [ It could cause incorrect
	 * page aging and the (mistaken) reclaim of hot pages, but the
	 * chance of that should be relatively low. ]
	 *
	 * So as a performance optimization don't flush the TLB when
	 * clearing the accessed bit, it will eventually be flushed by
	 * a context switch or a VM operation anyway. [ In the rare
	 * event of it not getting flushed for a long time the delay
	 * shouldn't really matter because there's no real memory
	 * pressure for swapout to react to. ]
	 */
	return ptep_test_and_clear_young(vma, address, ptep);
}

I don't have access to other architectures, so I think we can postpone 
this optimization unless someone is interested in optimizing the TLB flush.



More information about the linux-arm-kernel mailing list