[RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Wed May 13 17:00:19 PDT 2026

On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>> =========
>> The benchmarks are done on 160 core AmpereOne machine. The baseline is
>> v7.1-rc1 kernel.
>>
>> 1. Kernel Build
>> ---------------
>> Run kernel build (make -j160) with the default Fedora kernel config in a
>> memcg.
>> 13% - 18% sys time improvment
>> 3% - 7% wall time improvement
> This is pretty impressive!

Thank you.

>
> There was quite some feedback during the LSF/MM session, what's the current plan?

We didn't talk about the plan in the LSFMM session due to time ran out. 
I had some hallway conversation with Ryan. He said he will try to 
replicate the performance benchmarks on some other ARM64 machines.

He raised the concern about CNP (Common not Private), but neither I nor 
he can find machines with shared TLB. We do need some help to run the 
patchset on those machines because disabling CNP may have some 
performance implication.

I plan to polish up the patchset. There are still a lot work to do to 
make it in a better shape. Sounds likes a plan?

I'm not sure whether S390 folks will implement this on S390 or not, 
anyway they are cc'ed.

>
> Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
> there a way forward?

Yeah, it was discussed. My point is it makes some sense for x86 to not 
have per cpu page table because userspace and kernel share the same page 
table on x86, so the number of kernel page tables is actually unbounded. 
But ARM64 is different. The hardware supports separate userspace and 
kernel page tables, so the number of kernel page tables is actually 
bounded by the number of CPUs. And my regression tests didn't show 
noticeable regression for setting up percpu local mapping for 160 cores 
(means 160 kernel page tables).

So we should maximize the hardware benefit IMHO. And it should be up to 
the architecture maintainers.

>
>
> Finally, in the LSF/MM session, there was the question why the preemption
> handling is even required. Can you describe what the problem is?

Someone questioned why not just remove preempt_disable/enable because we 
just care about the sum of the counters. It may be ok for some cases, 
for example, some simple statistics, but it may cause problems for a lot 
usecases, for example:
     - __this_cpu_*() ops don't use atomic instructions. If they happen 
to access the same counter with this_cpu_*() concurrently, the counter 
may be corrupted.
     - this_cpu_write() may write a value or pointer, it may corrupt the 
remote CPU's copy.
     - The percpu counter may call into slow path to flush the per cpu 
counters to a global counter if some threshold is reached, the imprecise 
per cpu counter may result in suboptimal behavior, for example, calling 
in slow path more than necessary.
     - Cause the statistics out of sync or larger deviation than 
expected because the counter flush is not done due to comparing the 
threshold with wrong value.
     - AFAIK, scheduler may use percpu counter for some percpu lock, the 
imprecise counter may cause lockup and misbehavior.
     - And some subsystems maintain percpu state, then make decision 
based on the percpu state. The corrupted percpu state may cause various 
problems.
     - this_cpu_cmpxchg() may compare the remote CPU's value and result 
in indefinite loop.

There are a lot other cases that I may be not aware of because percpu is 
widely used by various subsystems. Anyway the spec is this_cpu_*() ops 
just can access local CPU copy. Accessing remote CPU's data is 
definitely not expected and may cause various problems.

Thanks,
Yang

>