[PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
Baolin Wang
baolin.wang at linux.alibaba.com
Sun Jan 18 21:50:47 PST 2026
On 1/18/26 1:46 PM, Dev Jain wrote:
>
> On 16/01/26 7:58 pm, Barry Song wrote:
>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain at arm.com> wrote:
>>>
>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang at gmail.com> wrote:
>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>
>>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>
>>>>>>> Performance testing:
>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>>> on my X86 machine) with this patch.
>>>>>>>
>>>>>>> W/o patch:
>>>>>>> real 0m1.018s
>>>>>>> user 0m0.000s
>>>>>>> sys 0m1.018s
>>>>>>>
>>>>>>> W/ patch:
>>>>>>> real 0m0.249s
>>>>>>> user 0m0.000s
>>>>>>> sys 0m0.249s
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts at arm.com>
>>>>>>> Acked-by: Barry Song <baohua at kernel.org>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang at linux.alibaba.com>
>>>>>>> ---
>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>> end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>> max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>
>>>>>>> - /* We only support lazyfree batching for now ... */
>>>>>>> - if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>> + /* We only support lazyfree or file folios batching for now ... */
>>>>>>> + if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>> return 1;
>>>>>>> +
>>>>>>> if (pte_unused(pte))
>>>>>>> return 1;
>>>>>>>
>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>> *
>>>>>>> * See Documentation/mm/mmu_notifier.rst
>>>>>>> */
>>>>>>> - dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>> + add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>> }
>>>>>>> discard:
>>>>>>> if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>> --
>>>>>>> 2.47.3
>>>>>>>
>>>>>> Hi, Baolin
>>>>>>
>>>>>> When reading your patch, I come up one small question.
>>>>>>
>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>
>>>>>> try_to_unmap_one()
>>>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>>>> nr_pages = folio_unmap_pte_batch()
>>>>>>
>>>>>> if (nr_pages = folio_nr_pages(folio))
>>>>>> goto walk_done;
>>>>>> }
>>>>>>
>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>
>>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>
>>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>>> skip the cleared range?
>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>>
>>>>> take a look:
>>>>>
>>>>> next_pte:
>>>>> do {
>>>>> pvmw->address += PAGE_SIZE;
>>>>> if (pvmw->address >= end)
>>>>> return not_found(pvmw);
>>>>> /* Did we cross page table boundary? */
>>>>> if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>> if (pvmw->ptl) {
>>>>> spin_unlock(pvmw->ptl);
>>>>> pvmw->ptl = NULL;
>>>>> }
>>>>> pte_unmap(pvmw->pte);
>>>>> pvmw->pte = NULL;
>>>>> pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>> goto restart;
>>>>> }
>>>>> pvmw->pte++;
>>>>> } while (pte_none(ptep_get(pvmw->pte)));
>>>>>
>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>>> will be skipped.
>>>>
>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> */
>>>> if (nr_pages == folio_nr_pages(folio))
>>>> goto walk_done;
>>>> + else {
>>>> + pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>> + pvmw.pte += nr_pages - 1;
>>>> + }
>>>> continue;
>>>> walk_abort:
>>>> ret = false;
>>> I am of the opinion that we should do something like this. In the internal pvmw code,
>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>> is the right place. If we really want to skip certain PTEs early,
>> should we instead hint page_vma_mapped_walk()? That said, I don't
>> see much value in doing so, since in most cases nr is either 1 or
>> folio_nr_pages(folio).
>>
>>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>>> to not none, and we will lose the batching effect. I also plan to extend support to
>>> anonymous folios (therefore generalizing for all types of memory) which will set a
>>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>>> batch.
>> Thanks for catching this, Dev. I already filter out some of the more
>> complex cases, for example:
>> if (pte_unused(pte))
>> return 1;
>>
>> Since the userfaultfd write-protection case is also a corner case,
>> could we filter it out as well?
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index c86f1135222b..6bb8ba6f046e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>> folio_unmap_pte_batch(struct folio *folio,
>> if (pte_unused(pte))
>> return 1;
>>
>> + if (userfaultfd_wp(vma))
>> + return 1;
>> +
>> return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> }
>>
>> Just offering a second option — yours is probably better.
>
> No. This is not an edge case. This is a case which gets exposed by your work, and
> I believe that if you intend to get the file folio batching thingy in, then you
> need to fix the uffd stuff too.
Barry’s point isn’t that this is an edge case. I think he means that
uffd is not a common performance-sensitive scenario in production. Also,
we typically fall back to per-page handling for uffd cases (see
finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
suggestion and filter out the uffd cases until we have test case to show
performance improvement.
I also think you can continue iterating your patch[1] to support batched
unmapping for uffd VMAs, and provide data to evaluate its value.
[1]
https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
More information about the linux-arm-kernel
mailing list