[RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
Yang Shi
yang at os.amperecomputing.com
Wed May 13 17:00:19 PDT 2026
On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>> =========
>> The benchmarks are done on 160 core AmpereOne machine. The baseline is
>> v7.1-rc1 kernel.
>>
>> 1. Kernel Build
>> ---------------
>> Run kernel build (make -j160) with the default Fedora kernel config in a
>> memcg.
>> 13% - 18% sys time improvment
>> 3% - 7% wall time improvement
> This is pretty impressive!
Thank you.
>
> There was quite some feedback during the LSF/MM session, what's the current plan?
We didn't talk about the plan in the LSFMM session due to time ran out.
I had some hallway conversation with Ryan. He said he will try to
replicate the performance benchmarks on some other ARM64 machines.
He raised the concern about CNP (Common not Private), but neither I nor
he can find machines with shared TLB. We do need some help to run the
patchset on those machines because disabling CNP may have some
performance implication.
I plan to polish up the patchset. There are still a lot work to do to
make it in a better shape. Sounds likes a plan?
I'm not sure whether S390 folks will implement this on S390 or not,
anyway they are cc'ed.
>
> Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
> there a way forward?
Yeah, it was discussed. My point is it makes some sense for x86 to not
have per cpu page table because userspace and kernel share the same page
table on x86, so the number of kernel page tables is actually unbounded.
But ARM64 is different. The hardware supports separate userspace and
kernel page tables, so the number of kernel page tables is actually
bounded by the number of CPUs. And my regression tests didn't show
noticeable regression for setting up percpu local mapping for 160 cores
(means 160 kernel page tables).
So we should maximize the hardware benefit IMHO. And it should be up to
the architecture maintainers.
>
>
> Finally, in the LSF/MM session, there was the question why the preemption
> handling is even required. Can you describe what the problem is?
Someone questioned why not just remove preempt_disable/enable because we
just care about the sum of the counters. It may be ok for some cases,
for example, some simple statistics, but it may cause problems for a lot
usecases, for example:
- __this_cpu_*() ops don't use atomic instructions. If they happen
to access the same counter with this_cpu_*() concurrently, the counter
may be corrupted.
- this_cpu_write() may write a value or pointer, it may corrupt the
remote CPU's copy.
- The percpu counter may call into slow path to flush the per cpu
counters to a global counter if some threshold is reached, the imprecise
per cpu counter may result in suboptimal behavior, for example, calling
in slow path more than necessary.
- Cause the statistics out of sync or larger deviation than
expected because the counter flush is not done due to comparing the
threshold with wrong value.
- AFAIK, scheduler may use percpu counter for some percpu lock, the
imprecise counter may cause lockup and misbehavior.
- And some subsystems maintain percpu state, then make decision
based on the percpu state. The corrupted percpu state may cause various
problems.
- this_cpu_cmpxchg() may compare the remote CPU's value and result
in indefinite loop.
There are a lot other cases that I may be not aware of because percpu is
widely used by various subsystems. Anyway the spec is this_cpu_*() ops
just can access local CPU copy. Accessing remote CPU's data is
definitely not expected and may cause various problems.
Thanks,
Yang
>
More information about the linux-arm-kernel
mailing list