[RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
Yang Shi
yang at os.amperecomputing.com
Wed Apr 29 10:04:28 PDT 2026
Introduction
============
This patch series implemented the LSFMM 2026 proposal for optimizing
this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
I didn't repeat it in the cover letter because there is no change to the
proposal.
The series is based on 7.1-rc1. It is basically minimum viable patches.
There are still a few hacks in this series and it may break something,
for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
good enough for now to demonstrate the core idea. The main purpose of the
RFC is to gather feedback early, figure out missing parts and risks, and
make sure we are on the right track, as well as hopefully it can help the
discussion for the upcoming LSFMM.
I broke the patches down to arch-dependent and arch-independent parts so that
hopefully the interested persons can do experiments on other architectures,
for example, S390, easier.
A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
which can support this feature will select it. Allocating and freeing percpu
local mapping is protected by this config so that others won't pay the cost.
Known Issues
============
1. KPIT
-------
We need determine what CPU we are on, then switch to the right page table.
Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
anymore except CPU #0. So we need to figure out the other way to handle it.
Switching to tramp_pg_dir should be easy, but the reverse seems harder because
tramp_pg_dir just maps the trampoline vectors.
Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
then switch to per cpu page table (for entry) or tramp page table (for exit).
Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
kernel -> userspace exit stage.
2. Shared TLB machines
----------------------
Some machines may share TLB between CPUs, for example, SMT machines may share
TLB between the two hardware threads in one single core.
The per cpu page table just can't work with it. Maybe we need a new
cpufeature to indicate whether per cpu page table is allowed or not. Then
just enable it for not-shared-TLB machines.
Benchmark
=========
The benchmarks are done on 160 core AmpereOne machine. The baseline is
v7.1-rc1 kernel.
1. Kernel Build
---------------
Run kernel build (make -j160) with the default Fedora kernel config in a
memcg.
13% - 18% sys time improvment
3% - 7% wall time improvement
2. stress-ng vm ops
-------------------
stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
8.5% improvement
3. stress-ng vm ops + fork
----------------------
stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
15% improvement
Regression test
===============
1. memcg creation
-----------------
Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
variables, for example, percpu refcnt, rstat and objcg percpu refcnt.
Consumed 2112K more virtual memory for percpu “local mapping” and a few
more mega bytes consumed by per cpu page tables.
No noticeable regression was found for elapsed time.
2. fork test
------------
stress-ng --fork 160 --fork-ops 10000000
fork() needs to allocate multiple percpu variables, for example, rss
counters and mm_cid_cpu.
Roughly 1% regression was found. However stress-ng fork test has quites
small address space, the real life workloads typically have much larger
address space and do more complicated works. The stress-ng mmapfork
benchmark saw 15% improvement.
Yang Shi (11):
arm64: mm: enable percpu kernel page table
arm64: mm: define percpu virtual space area
arm64: smp: define setup_per_cpu_areas()
mm: percpu: prepare to use dedicated percpu area
arm64: mm: map local percpu first chunk
mm: percpu: set up first chunk and reserve chunk
arm64: mm: introduce __per_cpu_local_off
vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
mm: percpu: allocate and free local percpu vm area
arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
arm64: percpu: use local percpu for this_cpu_*() APIs
arch/arm64/Kconfig | 2 +-
arch/arm64/include/asm/mmu.h | 3 +++
arch/arm64/include/asm/mmu_context.h | 6 +++++-
arch/arm64/include/asm/percpu.h | 17 ++++++++++-------
arch/arm64/include/asm/pgtable.h | 24 +++++++++++++++++++++---
arch/arm64/kernel/setup.c | 3 +++
arch/arm64/kernel/smp.c | 40 ++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/ptdump.c | 4 ++++
drivers/base/arch_numa.c | 51 +--------------------------------------------------
include/linux/percpu.h | 4 +++-
include/linux/vmalloc.h | 3 +++
mm/Kconfig | 3 +++
mm/internal.h | 5 ++++-
mm/kmsan/hooks.c | 14 +++++++-------
mm/percpu-internal.h | 15 +++++++++++++++
mm/percpu-vm.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/percpu.c | 46 +++++++++++++++++++++++++++++++++++++---------
mm/vmalloc.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
19 files changed, 419 insertions(+), 99 deletions(-)
Thanks,
Yang
More information about the linux-arm-kernel
mailing list