Overhead of arm64 LSE per-CPU atomics?

Fri Oct 31 11:30:31 PDT 2025

On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> To make event tracing safe for PREEMPT_RT kernels, I have been creating
> optimized variants of SRCU readers that use per-CPU atomics.  This works
> quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).

That's quite a difference. Does it get any better if
CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
on the kernel command line.

Depending on the implementation and configuration, the LSE atomics may
skip the L1 cache and be executed closer to the memory (they used to be
called far atomics). The CPUs try to be smarter like doing the operation
"near" if it's in the cache but the heuristics may not always work.

Interestingly, we had this patch recently to force a prefetch before the
atomic:

https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/

We rejected it but I wonder whether it improves the SRCU scenario.

-- 
Catalin