[RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

Mon Apr 17 01:04:57 PDT 2023

On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>  - write fault on unallocated address: do_anonymous_page()
>  - write fault on zero page: wp_page_copy()
>  - write fault on non-exclusive CoW page: wp_page_copy()
>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.
> 
> If allocation of our preferred folio order fails, we gracefully fall back to
> lower orders all the way to 0.
> 
> Note that none of this affects the behavior of traditional PMD-sized THP. If we
> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
> 
> Open Questions
> ==============
> 
> How to Move Forwards
> --------------------
> 
> While the series is a small-ish code change, it represents a big shift in the
> way things are done. So I'd appreciate any help in scaling up performance
> testing, review and general advice on how best to guide a change like this into
> the kernel.
> 
> Folio Allocation Order Policy
> -----------------------------
> 
> The current code is hardcoded to use a maximum order of 4. This was chosen for a
> couple of reasons:
>  - From the SW performance perspective, I see a knee around here where
>    increasing it doesn't lead to much more performance gain.
>  - Intuitively I assume that higher orders become increasingly difficult to
>    allocate.
>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>    any higher.
> 
> I suggest that ultimately setting the max order should be left to the
> architecture. arm64 would take advantage of this and set it to the order
> required for the contiguous bit for the configured base page size.
> 
> However, I also have a (mild) concern about increased memory consumption. If an
> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
> we would end up allocating 16x as much memory as we used to. One potential
> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
> max allocation order for consecutive faults that extend a contiguous range, and
> decrement when discontiguous. Alternatively/additionally, we could use the VMA
> size as an indicator. I'd be interested in your thoughts/opinions.
> 
> Deferred Split Queue Lock Contention
> ------------------------------------
> 
> The results below show that we are spending a much greater proportion of time in
> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
> 
> I think this is (at least partially) related for contention on the deferred
> split queue lock. This is a per-memcg spinlock, which means a single spinlock
> shared among all 160 CPUs. I've solved part of the problem with the last patch
> in the series (which cuts down the need to take the lock), but at folio free
> time (free_transhuge_page()), the lock is still taken and I think this could be
> a problem. Now that most anonymous pages are large folios, this lock is taken a
> lot more.
> 
> I think we could probably avoid taking the lock unless !list_empty(), but I
> haven't convinced myself its definitely safe, so haven't applied it yet.
Yes. It's safe. We also identified other lock contention with large folio
for anonymous mapping like lru lock and zone lock. My understanding is that
the anonymous page has much higher alloc/free frequency than page cache.

So the lock contention was not exposed by large folio for page cache.

I posted the related patch to:
https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t

Regards
Yin, Fengwei

> 
> Roadmap
> =======
> 
> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
> bit" on arm64 to validate predictions about HW speedups.
> 
> I also think there are some opportunities with madvise to split folios to non-0
> orders, which might improve performance in some cases. madvise is also mistaking
> exclusive large folios for non-exclusive ones at the moment (due to the "small
> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
> frees the folio.
> 
> Results
> =======
> 
> Performance
> -----------
> 
> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
> before each run.
> 
> make defconfig && time make -jN Image
> 
> First with -j8:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |          373.0 |          342.8 |          -8.1% |
> | user-time |         2333.9 |         2275.3 |          -2.5% |
> | sys-time  |          510.7 |          340.9 |         -33.3% |
> 
> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
> for the 8 job config:
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | data abort           |     683B |      316B |         -53.8% |
> | instruction abort    |      93B |       76B |         -18.4% |
> | syscall              |     887B |      767B |         -13.6% |
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | arm64_sys_openat     |     194B |      188B |          -3.3% |
> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
> | arm64_sys_read       |     124B |      108B |         -12.7% |
> | arm64_sys_execve     |      75B |       67B |         -11.0% |
> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
> | arm64_sys_write      |      43B |       42B |          -2.9% |
> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
> | arm64_sys_clone      |      26B |       24B |         -10.0% |
> 
> And now with -j160:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |           53.7 |           48.2 |         -10.2% |
> | user-time |         2705.8 |         2842.1 |           5.0% |
> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
> 
> Above shows a 10.2% improvement in real time execution. But ~3x more time is
> spent in the kernel than for the -j8 config. I think this is related to the lock
> contention issue I highlighted above, but haven't bottomed it out yet. It's also
> not yet clear to me why user-time increases by 5%.
> 
> I've also run all the will-it-scale microbenchmarks for a single task, using the
> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
> fluctuation. So I'm just calling out tests with results that have gt 5%
> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
> are regressed:
> 
> | benchmark            | baseline | anonfolio | percent change |
> |                      | ops/s    | ops/s     | BIGGER=better  |
> | ---------------------|---------:|----------:|---------------:|
> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
> | read1.csv            |  1806908 |   1905752 |          5.5%  |
> | read2.csv            |   587722 |   1942062 |        230.4%  |
> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
> 
> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
> object in a loop and never touches the allocated memory. I think the malloc
> implementation is maintaining a header just before the allocated object, which
> causes a single page fault. Previously that page fault allocated 1 page. Now it
> is allocating 16 pages. This cost would be repaid if the test code wrote to the
> allocated object. Alternatively the folio allocation order policy described
> above would also solve this.
> 
> It is not clear to me why mmap1 has slowed down. This remains a todo.
> 
> Memory
> ------
> 
> I measured memory consumption while doing a kernel compile with 8 jobs on a
> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
> workload, then calcualted "memory used" high and low watermarks using both
> MemFree and MemAvailable. If there is a better way of measuring system memory
> consumption, please let me know!
> 
> mem-used = 4GB - /proc/meminfo:MemFree
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      825 |       842 |           2.1% |
> | mem-used-high        |     2697 |      2672 |          -0.9% |
> 
> mem-used = 4GB - /proc/meminfo:MemAvailable
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      518 |       530 |           2.3% |
> | mem-used-high        |     1522 |      1537 |           1.0% |
> 
> For the high watermark, the methods disagree; we are either saving 1% or using
> 1% more. For the low watermark, both methods agree that we are using about 2%
> more. I plan to investigate whether the proposed folio allocation order policy
> can reduce this to zero.
> 
> Thanks for making it this far!
> Ryan
> 
> 
> Ryan Roberts (17):
>   mm: Expose clear_huge_page() unconditionally
>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>   mm: Introduce try_vma_alloc_movable_folio()
>   mm: Implement folio_add_new_anon_rmap_range()
>   mm: Routines to determine max anon folio allocation order
>   mm: Allocate large folios for anonymous memory
>   mm: Allow deferred splitting of arbitrary large anon folios
>   mm: Implement folio_move_anon_rmap_range()
>   mm: Update wp_page_reuse() to operate on range of pages
>   mm: Reuse large folios for anonymous memory
>   mm: Split __wp_page_copy_user() into 2 variants
>   mm: ptep_clear_flush_range_notify() macro for batch operation
>   mm: Implement folio_remove_rmap_range()
>   mm: Copy large folios for anonymous memory
>   mm: Convert zero page to large folios on write
>   mm: mmap: Align unhinted maps to highest anon folio order
>   mm: Batch-zap large anonymous folio PTE mappings
> 
>  arch/alpha/include/asm/page.h   |   5 +-
>  arch/arm64/include/asm/page.h   |   3 +-
>  arch/arm64/mm/fault.c           |   7 +-
>  arch/ia64/include/asm/page.h    |   5 +-
>  arch/m68k/include/asm/page_no.h |   7 +-
>  arch/s390/include/asm/page.h    |   5 +-
>  arch/x86/include/asm/page.h     |   5 +-
>  include/linux/highmem.h         |  23 +-
>  include/linux/mm.h              |   8 +-
>  include/linux/mmu_notifier.h    |  31 ++
>  include/linux/rmap.h            |   6 +
>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>  mm/mmap.c                       |   4 +-
>  mm/rmap.c                       | 147 +++++-
>  14 files changed, 1000 insertions(+), 133 deletions(-)
> 
> --
> 2.25.1
>