[PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings

Mon Jul 10 05:05:19 PDT 2023

On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts at arm.com> wrote:
>
> Hi All,
>
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. It is part of a wider effort to improve performance of the 4K
> kernel with the aim of approaching the performance of the 16K kernel, but
> without breaking compatibility and without the associated increase in memory. It
> also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
> contpte size for those kernels.
>
> Of course this is only one half of the change. We require the mapped physical
> memory to be the correct size and alignment for this to actually be useful (i.e.
> 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
> problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
> allocate large folios up to the PMD size today, and more filesystems are coming.
> And the other half of my work, to enable the use of large folios for anonymous
> memory, aims to make contpte sized folios prevalent for anonymous memory too.
>
>
> Dependencies
> ------------
>
> While there is a complicated set of hard and soft dependencies that this patch
> set depends on, I wanted to split it out as best I could and kick off proper
> review independently.
>
> The series applies on top of these other patch sets, with a tree at:
> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1
>
> v6.4-rc6
>   - base
>
> set_ptes()
>   - hard dependency
>   - Patch set from Matthew Wilcox to set multiple ptes with a single API call
>   - Allows arch backend to more optimally apply contpte mappings
>   - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>
> ptep_get() pte encapsulation
>   - hard dependency
>   - Enabler series from me to ensure none of the core code ever directly
>     dereferences a pte_t that lies within a live page table.
>   - Enables gathering access/dirty bits from across the whole contpte range
>   - in mm-stable and linux-next at time of writing
>   - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/
>
> Report on physically contiguous memory in smaps
>   - soft dependency
>   - Enables visibility on how much memory is physically contiguous and how much
>     is contpte-mapped - useful for debug
>   - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
>
> Additionally there are a couple of other dependencies:
>
> anonfolio
>   - soft dependency
>   - ensures more anonymous memory is allocated in contpte-sized folios, so
>     needed to realize the performance improvements (this is the "other half"
>     mentioned above).
>   - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
>   - Intending to post v1 shortly.
>
> exefolio
>   - soft dependency
>   - Tweak readahead to ensure executable memory are in 64K-sized folios, so
>     needed to see reduction in iTLB pressure.
>   - Don't intend to post this until we are further down the track with contpte
>     and anonfolio.
>
> Arm ARM Clarification
>   - hard dependency
>   - Current wording disallows the fork() optimization in the final patch.
>   - Arm (ATG) have proposed tightening the wording to permit it.
>   - In conversation with partners to check this wouldn't cause problems for any
>     existing HW deployments
>
> All of the _hard_ dependencies need to be resolved before this can be considered
> for merging.
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. anonfolio and exefolio are as
> described above. contpte is this series. (Note that exefolio only gives an
> improvement because contpte is already in place).
>
> Kernel Compilation (smaller is better):
>
> | kernel       |   real-time |   kern-time |   user-time |
> |:-------------|------------:|------------:|------------:|
> | baseline-4k  |        0.0% |        0.0% |        0.0% |
> | anonfolio    |       -5.4% |      -46.0% |       -0.3% |
> | contpte      |       -6.8% |      -45.7% |       -2.1% |
> | exefolio     |       -8.4% |      -46.4% |       -3.7% |

sorry i am a bit confused. in exefolio case, is anonfolio included?
or it only has large cont-pte folios on exe code? in the other words,
Does the 8.4% improvement come from iTLB miss reduction only,
or from both dTLB and iTLB miss reduction?

> | baseline-16k |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel       |   runs_per_min |
> |:-------------|---------------:|
> | baseline-4k  |           0.0% |
> | anonfolio    |           1.2% |
> | contpte      |           3.1% |
> | exefolio     |           4.2% |

same question as above.

> | baseline-16k |           5.3% |
>
> I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
> gains.
>
> I've also verified that running the contpte changes without anonfolio and
> exefolio does not cause any regression vs baseline-4k.
>
>
> Opens
> -----
>
> The only potential issue that I see right now is that due to there only being 1
> access/dirty bit per contpte range, if a single page in the range is
> accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
> too. Access/dirty is managed by the kernel per _folio_, so this information gets
> collapsed down anyway, and nothing changes there. However, the per _page_
> access/dirty information is reported through pagemap to user space. I'm not sure
> if this would/should be considered a break? Thoughts?
>
> Thanks,
> Ryan

Thanks
Barry