[PATCH v5 0/5] variable-order, large folios for anonymous memory

Itaru Kitayama itaru.kitayama at gmail.com
Wed Aug 16 04:57:09 PDT 2023



> On Aug 16, 2023, at 18:25, Yin, Fengwei <fengwei.yin at intel.com> wrote:
> 
> 
> 
>> On 8/16/2023 4:11 PM, Itaru Kitayama wrote:
>> 
>> 
>>>> On Aug 10, 2023, at 23:29, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> This is v5 of a series to implement variable order, large folios for anonymous
>>> memory. (currently called "LARGE_ANON_FOLIO", previously called "FLEXIBLE_THP").
>>> The objective of this is to improve performance by allocating larger chunks of
>>> memory during anonymous page faults:
>>> 
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>  pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>  and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>  overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>  advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>  speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>  TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>> 
>>> This patch set deals with the SW side of things (1). (2) is being tackled in a
>>> separate series. The new behaviour is hidden behind a new Kconfig switch,
>>> LARGE_ANON_FOLIO, which is disabled by default. Although the eventual aim is to
>>> enable it by default.
>>> 
>>> My hope is that we are pretty much there with the changes at this point;
>>> hopefully this is sufficient to get an initial version merged so that we can
>>> scale up characterization efforts. Although they should not be merged until the
>>> prerequisites are complete. These are in progress and tracked at [5].
>>> 
>>> This series is based on mm-unstable (ad3232df3e41).
>>> 
>>> I'm going to be out on holiday from the end of today, returning on 29th
>>> August. So responses will likely be patchy, as I'm terrified of posting
>>> to list from my phone!
>>> 
>>> 
>>> Testing
>>> -------
>>> 
>>> This version adds patches to mm selftests so that the cow tests explicitly test
>>> large anon folios, in the same way that thp is tested. When enabled you should
>>> see something similar at the start of the test suite:
>>> 
>>> # [INFO] detected large anon folio size: 32 KiB
>>> 
>>> Then the following results are expected. The fails and skips are due to existing
>>> issues in mm-unstable:
>>> 
>>> # Totals: pass:207 fail:16 xfail:0 xpass:0 skip:85 error:0
>>> 
>>> Existing mm selftests reveal 1 regression in khugepaged tests when
>>> LARGE_ANON_FOLIO is enabled:
>>> 
>>> Run test: collapse_max_ptes_none (khugepaged:anon)
>>> Maybe collapse with max_ptes_none exceeded.... Fail
>>> Unexpected huge page
>>> 
>>> I believe this is because khugepaged currently skips non-order-0 pages when
>>> looking for collapse opportunities and should get fixed with the help of
>>> DavidH's work to create a mechanism to precisely determine shared vs exclusive
>>> pages.
>>> 
>>> 
>>> Changes since v4 [4]
>>> --------------------
>>> 
>>> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
>>>   now uses the default order-3 size. I have moved this patch over to
>>>   the contpte series.
>>> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
>>>   into series. I originally removed this at v2 to add to a separate series,
>>>   but that series has transformed significantly and it no longer fits, so
>>>   bringing it back here.
>>> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
>>>   set_ptes() is in mm-unstable now.
>>> - Updated policy for when to allocate LAF; only fallback to order-0 if
>>>   MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
>>>   sysfs's never/madvise/always knob.
>>> - Fallback to order-0 whenever uffd is armed for the vma, not just when
>>>   uffd-wp is set on the pte.
>>> - alloc_anon_folio() now returns `strucxt folio *`, where errors are encoded
>>>   with ERR_PTR().
>>> 
>>> The last 3 changes were proposed by Yu Zhao - thanks!
>>> 
>>> 
>>> Changes since v3 [3]
>>> --------------------
>>> 
>>> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
>>> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
>>>   sysctl is preferable but we will wait until real workload needs it.
>>> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
>>> - Added mm selftests for large anon folios in cow test suite.
>>> 
>>> 
>>> Changes since v2 [2]
>>> --------------------
>>> 
>>> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
>>>     - Huang, Ying suggested the "batch zap" work (which I dropped from this
>>>       series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
>>>       moved the deferred split patch to a separate series along with the batch
>>>       zap changes. I plan to submit this series early next week.
>>> - Changed folio order fallback policy
>>>     - We no longer iterate from preferred to 0 looking for acceptable policy
>>>     - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
>>> - Removed vma parameter from arch_wants_pte_order()
>>> - Added command line parameter `flexthp_unhinted_max`
>>>     - clamps preferred order when vma hasn't explicitly opted-in to THP
>>> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
>>>   for process or system).
>>> - Simplified implementation and integration with do_anonymous_page()
>>> - Removed dependency on set_ptes()
>>> 
>>> 
>>> Changes since v1 [1]
>>> --------------------
>>> 
>>> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>>> - replaced with arch-independent alloc_anon_folio()
>>>     - follows THP allocation approach
>>> - no longer retry with intermediate orders if allocation fails
>>>     - fallback directly to order-0
>>> - remove folio_add_new_anon_rmap_range() patch
>>>     - instead add its new functionality to folio_add_new_anon_rmap()
>>> - remove batch-zap pte mappings optimization patch
>>>     - remove enabler folio_remove_rmap_range() patch too
>>>     - These offer real perf improvement so will submit separately
>>> - simplify Kconfig
>>>     - single FLEXIBLE_THP option, which is independent of arch
>>>     - depends on TRANSPARENT_HUGEPAGE
>>>     - when enabled default to max anon folio size of 64K unless arch
>>>       explicitly overrides
>>> - simplify changes to do_anonymous_page():
>>>     - no more retry loop
>>> 
>>> 
>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
>>> [2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
>>> [3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
>>> [4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
>>> [5] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>> 
>>> 
>>> Thanks,
>>> Ryan
>>> 
>>> Ryan Roberts (5):
>>> mm: Allow deferred splitting of arbitrary large anon folios
>>> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
>>> mm: LARGE_ANON_FOLIO for improved performance
>>> selftests/mm/cow: Generalize do_run_with_thp() helper
>>> selftests/mm/cow: Add large anon folio tests
>>> 
>>> include/linux/pgtable.h          |  13 ++
>>> mm/Kconfig                       |  10 ++
>>> mm/memory.c                      | 144 +++++++++++++++++--
>>> mm/rmap.c                        |  31 +++--
>>> tools/testing/selftests/mm/cow.c | 229 ++++++++++++++++++++++---------
>>> 5 files changed, 347 insertions(+), 80 deletions(-)
>>> 
>>> --
>>> 2.25.1
>>> 
>> 
>> I know Ryan is away currently, but as I can’t find the base commit mentioned in the cover letter to be based off of can anybody point me to it so I can use b4 for applying the series and test?
>> 
> Ryan mentioned: This series is based on mm-unstable (ad3232df3e41).

Couldn’t find the commit in the mm-unstable branch I checked out today. I’m trying to use Andrew’s mm tree for the first time in a decade so I’m doing something wrong though.

> 
> I believe you can apply the patchset to latest mm-unstable.

Okay. Will try that.

Thanks,
Itaru.

> 
> 
> Regards
> Yin, Fengwei
> 
>> Thanks,
>> Itaru.



More information about the linux-arm-kernel mailing list