Overhead of arm64 LSE per-CPU atomics?

Fri Oct 31 12:39:41 PDT 2025

On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> 
> That's quite a difference. Does it get any better if
> CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> on the kernel command line.

In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?

Yes, this gets me more than an order of magnitude, and about 30% better
than my workaround of disabling interrupts around a non-atomic increment
of those counters, thank you!

Given that per-CPU atomics are usually not heavily contended, would it
make sense to avoid LSE in that case?

And I need to figure out whether I should recommend that Meta build
its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
might have would be deeply appreciated!  (I am of course also following
up internally.)

> Depending on the implementation and configuration, the LSE atomics may
> skip the L1 cache and be executed closer to the memory (they used to be
> called far atomics). The CPUs try to be smarter like doing the operation
> "near" if it's in the cache but the heuristics may not always work.

My knowledge-free guess is that it is early days for LSE, and that it
therefore has significant hardware-level optimization work ahead of it.
For example, I well recall being roundly denounced by Intel engineers in
my neighborhood for reporting similar performance results on Pentium 4
back in the day.  The truth might well have set them free, but it sure
didn't make them happy!  ;-)

But what would a non-knowledge-free guess be?

> Interestingly, we had this patch recently to force a prefetch before the
> atomic:
> 
> https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> 
> We rejected it but I wonder whether it improves the SRCU scenario.

No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
case that matters.  Here are my results for the underlying this_cpu_inc()
and this_cpu_dec() pair of operations:

	LSE Atomics Enabled (Stock)	LSE Atomics Disabled

Without Yicong’s Patch (Stock)

			    110.786		       9.852

With Yicong’s Patch

			    109.873		       9.853

As you can see, disabling LSE gets about an order of magnitude
and Yicong's patch has no statistically significant effect.

This and more can be found in the "Per-CPU Increment/Decrement"
section of this Google document:

https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing

Full disclosure: Calls to srcu_read_lock_fast() followed by
srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
this_cpu_inc(), but I am not seeing any difference between the two.
And testing the underlying primitives allows my tests to give reproducible
results regardless of what state I have the SRCU code in.  ;-)

Thoughts?

							Thanx, Paul