[PATCH v8 00/10] Multi-size THP for anonymous memory

Mon Dec 4 19:37:34 PST 2023

On 12/4/23 02:20, Ryan Roberts wrote:
> Hi All,
> 
> A new week, a new version, a new name... This is v8 of a series to implement
> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
> this fares better.
> 
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
> 
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> 
> This version changes the name and tidies up some of the kernel code and test
> code, based on feedback against v7 (see change log for details).

Using a couple of Armv8 systems, I've tested this patchset. I applied it
to top of tree (Linux 6.7-rc4), on top of your latest contig pte series
[1].

With those two patchsets applied, the mm selftests look OK--or at least
as OK as they normally do. I compared test runs between THP/mTHP set to
"always", vs "never", to verify that there were no new test failures.
Details: specifically, I set one particular page size (2 MB) to
"inherit", and then toggled /sys/kernel/mm/transparent_hugepage/enabled
between "always" and "never".

I also re-ran my usual compute/AI benchmark, and I'm still seeing the
same 10x performance improvement that I reported for the v6 patchset.

So for this patchset and for [1] as well, please feel free to add:

Tested-by: John Hubbard <jhubbard at nvidia.com>

[1] https://lore.kernel.org/all/20231204105440.61448-1-ryan.roberts@arm.com/

thanks,
-- 
John Hubbard
NVIDIA

> 
> By default, the existing behaviour (and performance) is maintained. The user
> must explicitly enable multi-size THP to see the performance benefit. This is
> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
> David for the suggestion)! This interface is inspired by the existing
> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
> compatibility with the existing PMD-size THP interface, and provides a base for
> future extensibility. See [8] for detailed discussion of the interface.
> 
> This series is based on mm-unstable (715b67adf4c8).
> 
> 
> Prerequisites
> =============
> 
> Some work items identified as being prerequisites are listed on page 3 at [9].
> The summary is:
> 
> | item                          | status                  |
> |:------------------------------|:------------------------|
> | mlock                         | In mainline (v6.7)      |
> | madvise                       | In mainline (v6.6)      |
> | compaction                    | v1 posted [10]          |
> | numa balancing                | Investigated: see below |
> | user-triggered page migration | In mainline (v6.7)      |
> | khugepaged collapse           | In mainline (NOP)       |
> 
> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
> John Hubbard has investigated this and concluded that it is A) not clear at the
> moment what a better policy might be for PTE-mapped THP and B) questions whether
> this should really be considered a prerequisite given no regression is caused
> for the default "multi-size THP disabled" case, and there is no correctness
> issue when it is enabled - its just a potential for non-optimal performance.
> 
> If there are no disagreements about removing numa balancing from the list (none
> were raised when I first posted this comment against v7), then that just leaves
> compaction which is in review on list at the moment.
> 
> I really would like to get this series (and its remaining comapction
> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
> point, but lets see where we get to with review?
> 
> 
> Testing
> =======
> 
> The series includes patches for mm selftests to enlighten the cow and khugepaged
> tests to explicitly test with multi-size THP, in the same way that PMD-sized
> THP is tested. The new tests all pass, and no regressions are observed in the mm
> selftest suite. I've also run my usual kernel compilation and java script
> benchmarks without any issues.
> 
> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
> THP only - they do not include the arm64 contpte follow-on series).
> 
> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
> some workloads at [11]. (Observed using v6 of this series as well as the arm64
> contpte series).
> 
> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
> there are some latency regressions also.
> 
> 
> Changes since v7 [7]
> ====================
> 
>    - Renamed "small-sized THP" -> "multi-size THP" in commit logs
>    - Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
>    - Patch 3:
>        - Fine-tuned transhuge documentation multi-size THP (JohnH)
>        - Converted hugepage_global_enabled() and hugepage_global_always() macros
>          to static inline functions (JohnH)
>        - Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
>        - Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
>        - Renamed "global" enabled sysfs file option to "inherit" (JohnH)
>    - Patch 9:
>        - cow selftest: Renamed param size -> thpsize (David)
>        - cow selftest: Changed test fail to assert() (David)
>        - cow selftest: Log PMD size separately from all the supported THP sizes
>          (David)
>    - Patch 10:
>        - cow selftest: No longer special case pmdsize; keep all THP sizes in
>          thpsizes[]
> 
> 
> Changes since v6 [6]
> ====================
> 
>    - Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
>      JohnH)
>    - Dropped accounting patch (#3 in v6) (suggested by DavidH)
>        - Continue to account *PMD-sized* THP only for now
>        - Can add more counters in future if needed
>        - Page cache large folios haven't needed any new counters yet
>    - Pivot to sysfs ABI proposed by DavidH
>        - per-size directories in a similar shape to that used by hugetlb
>    - Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
>        - For now, users need to understand implicitly which sizes are beneficial
>          to their HW/SW
>    - Dropped arch_wants_pte_order() patch (#7 in v6)
>        - No longer needed due to dropping patch "recommend" keyword patch
>    - Enlightened khugepaged mm selftest to explicitly test with small-size THP
>    - Scrubbed commit logs to use "small-sized THP" consistently (suggested by
>      DavidH)
> 
> 
> Changes since v5 [5]
> ====================
> 
>    - Added accounting for PTE-mapped THPs (patch 3)
>    - Added runtime control mechanism via sysfs as extension to THP (patch 4)
>    - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
>    - Stripped out hardcoded policy for allocation order; its now all user space
>      controlled (although user space can request "recommend" which will configure
>      the HW-preferred order)
> 
> 
> Changes since v4 [4]
> ====================
> 
>    - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
>      now uses the default order-3 size. I have moved this patch over to
>      the contpte series.
>    - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
>      into series. I originally removed this at v2 to add to a separate series,
>      but that series has transformed significantly and it no longer fits, so
>      bringing it back here.
>    - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
>      set_ptes() is in mm-unstable now.
>    - Updated policy for when to allocate LAF; only fallback to order-0 if
>      MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
>      sysfs's never/madvise/always knob.
>    - Fallback to order-0 whenever uffd is armed for the vma, not just when
>      uffd-wp is set on the pte.
>    - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
>      with ERR_PTR().
> 
>    The last 3 changes were proposed by Yu Zhao - thanks!
> 
> 
> Changes since v3 [3]
> ====================
> 
>    - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
>    - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
>      sysctl is preferable but we will wait until real workload needs it.
>    - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
>    - Added mm selftests for large anon folios in cow test suite.
> 
> 
> Changes since v2 [2]
> ====================
> 
>    - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
>        - Huang, Ying suggested the "batch zap" work (which I dropped from this
>          series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
>          moved the deferred split patch to a separate series along with the batch
>          zap changes. I plan to submit this series early next week.
>    - Changed folio order fallback policy
>        - We no longer iterate from preferred to 0 looking for acceptable policy
>        - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
>    - Removed vma parameter from arch_wants_pte_order()
>    - Added command line parameter `flexthp_unhinted_max`
>        - clamps preferred order when vma hasn't explicitly opted-in to THP
>    - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
>      for process or system).
>    - Simplified implementation and integration with do_anonymous_page()
>    - Removed dependency on set_ptes()
> 
> 
> Changes since v1 [1]
> ====================
> 
>    - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>    - replaced with arch-independent alloc_anon_folio()
>        - follows THP allocation approach
>    - no longer retry with intermediate orders if allocation fails
>        - fallback directly to order-0
>    - remove folio_add_new_anon_rmap_range() patch
>        - instead add its new functionality to folio_add_new_anon_rmap()
>    - remove batch-zap pte mappings optimization patch
>        - remove enabler folio_remove_rmap_range() patch too
>        - These offer real perf improvement so will submit separately
>    - simplify Kconfig
>        - single FLEXIBLE_THP option, which is independent of arch
>        - depends on TRANSPARENT_HUGEPAGE
>        - when enabled default to max anon folio size of 64K unless arch
>          explicitly overrides
>    - simplify changes to do_anonymous_page():
>        - no more retry loop
> 
> 
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
> [5] https://lore.kernel.org/linux-mm/20230810142942.3169679-1-ryan.roberts@arm.com/
> [6] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/
> [7] https://lore.kernel.org/linux-mm/20231122162950.3854897-1-ryan.roberts@arm.com/
> [8] https://lore.kernel.org/linux-mm/6d89fdc9-ef55-d44e-bf12-fafff318aef8@redhat.com/
> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> [10] https://lore.kernel.org/linux-mm/20231113170157.280181-1-zi.yan@sent.com/
> [11] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
> [12] https://lore.kernel.org/linux-mm/479b3e2b-456d-46c1-9677-38f6c95a0be8@huawei.com/
> 
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (10):
>    mm: Allow deferred splitting of arbitrary anon large folios
>    mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
>    mm: thp: Introduce multi-size THP sysfs interface
>    mm: thp: Support allocation of anonymous multi-size THP
>    selftests/mm/kugepaged: Restore thp settings at exit
>    selftests/mm: Factor out thp settings management
>    selftests/mm: Support multi-size THP interface in thp_settings
>    selftests/mm/khugepaged: Enlighten for multi-size THP
>    selftests/mm/cow: Generalize do_run_with_thp() helper
>    selftests/mm/cow: Add tests for anonymous multi-size THP
> 
>   Documentation/admin-guide/mm/transhuge.rst |  97 ++++-
>   Documentation/filesystems/proc.rst         |   6 +-
>   fs/proc/task_mmu.c                         |   3 +-
>   include/linux/huge_mm.h                    | 116 ++++--
>   mm/huge_memory.c                           | 268 ++++++++++++--
>   mm/khugepaged.c                            |  20 +-
>   mm/memory.c                                | 114 +++++-
>   mm/page_vma_mapped.c                       |   3 +-
>   mm/rmap.c                                  |  32 +-
>   tools/testing/selftests/mm/Makefile        |   4 +-
>   tools/testing/selftests/mm/cow.c           | 185 +++++++---
>   tools/testing/selftests/mm/khugepaged.c    | 410 ++++-----------------
>   tools/testing/selftests/mm/run_vmtests.sh  |   2 +
>   tools/testing/selftests/mm/thp_settings.c  | 349 ++++++++++++++++++
>   tools/testing/selftests/mm/thp_settings.h  |  80 ++++
>   15 files changed, 1177 insertions(+), 512 deletions(-)
>   create mode 100644 tools/testing/selftests/mm/thp_settings.c
>   create mode 100644 tools/testing/selftests/mm/thp_settings.h
> 
> --
> 2.25.1
>