[PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
David Hildenbrand
david at redhat.com
Mon Jan 22 11:41:49 PST 2024
Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.
This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.
We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.
While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.
Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):
Folio Size | v6.8-rc1 | New | Change
------------------------------------------
4KiB | 0.014328 | 0.014265 | 0%
16KiB | 0.014263 | 0.013293 | - 7%
32KiB | 0.014334 | 0.012355 | -14%
64KiB | 0.014046 | 0.011837 | -16%
128KiB | 0.014011 | 0.011536 | -18%
256KiB | 0.013993 | 0.01134 | -19%
512KiB | 0.013983 | 0.011311 | -19%
1024KiB | 0.013986 | 0.011282 | -19%
2048KiB | 0.014305 | 0.011496 | -20%
Next up is PTE batching when unmapping, that I'll probably send out
based on this series this/next week.
Only tested on x86-64. Compile-tested on most other architectures. Will
do more testing and double-check the arch changes while this is getting
some review.
[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
Cc: Andrew Morton <akpm at linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy at infradead.org>
Cc: Ryan Roberts <ryan.roberts at arm.com>
Cc: Russell King <linux at armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas at arm.com>
Cc: Will Deacon <will at kernel.org>
Cc: Dinh Nguyen <dinguyen at kernel.org>
Cc: Michael Ellerman <mpe at ellerman.id.au>
Cc: Nicholas Piggin <npiggin at gmail.com>
Cc: Christophe Leroy <christophe.leroy at csgroup.eu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar at kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao at linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley at sifive.com>
Cc: Palmer Dabbelt <palmer at dabbelt.com>
Cc: Albert Ou <aou at eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev at linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer at linux.ibm.com>
Cc: Heiko Carstens <hca at linux.ibm.com>
Cc: Vasily Gorbik <gor at linux.ibm.com>
Cc: Christian Borntraeger <borntraeger at linux.ibm.com>
Cc: Sven Schnelle <svens at linux.ibm.com>
Cc: "David S. Miller" <davem at davemloft.net>
Cc: linux-arm-kernel at lists.infradead.org
Cc: linuxppc-dev at lists.ozlabs.org
Cc: linux-riscv at lists.infradead.org
Cc: linux-s390 at vger.kernel.org
Cc: sparclinux at vger.kernel.org
David Hildenbrand (11):
arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
nios2/pgtable: define PFN_PTE_SHIFT
powerpc/pgtable: define PFN_PTE_SHIFT
risc: pgtable: define PFN_PTE_SHIFT
s390/pgtable: define PFN_PTE_SHIFT
sparc/pgtable: define PFN_PTE_SHIFT
mm/memory: factor out copying the actual PTE in copy_present_pte()
mm/memory: pass PTE to copy_present_pte()
mm/memory: optimize fork() with PTE-mapped THP
mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
mm/memory: ignore writable bit in folio_pte_batch()
arch/arm/include/asm/pgtable.h | 2 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/nios2/include/asm/pgtable.h | 2 +
arch/powerpc/include/asm/pgtable.h | 2 +
arch/riscv/include/asm/pgtable.h | 2 +
arch/s390/include/asm/pgtable.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 2 +
include/linux/pgtable.h | 17 ++-
mm/memory.c | 188 +++++++++++++++++++++-------
9 files changed, 173 insertions(+), 46 deletions(-)
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
--
2.43.0
More information about the linux-arm-kernel
mailing list