Overhead of arm64 LSE per-CPU atomics?
Paul E. McKenney
paulmck at kernel.org
Fri Oct 31 15:21:38 PDT 2025
On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics. This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation. This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> >
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
>
> In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?
>
> Yes, this gets me more than an order of magnitude, and about 30% better
> than my workaround of disabling interrupts around a non-atomic increment
> of those counters, thank you!
>
> Given that per-CPU atomics are usually not heavily contended, would it
> make sense to avoid LSE in that case?
For example, how about something like the patch below?
Thanx, Paul
------------------------------------------------------------------------
commit 0c0b71d19c997915c5ef5fe7e32eb56b4e4a750e
Author: Paul E. McKenney <paulmckrcu at fb.com>
Date: Fri Oct 31 14:14:13 2025 -0700
arm64: Separately select LSE for per-CPU atomics
LSE atomics provide better scalability, but not always better single-CPU
performance. In fact, on the ARM Neoverse V2, they degrade single-CPU
performance by an order of magnitude, from about 5ns per operation to
about 50ns.
Now per-CPU atomics are rarely contended, in fact, a given per-CPU
variable is usually used mostly by the CPU in question. This means
that LSE's better scalability does not help, but its degraded single-CPU
performance does hurt.
Therefore, provide a new default-n ARM64_USE_LSE_PERCPU_ATOMICS Kconfig
option that uses LSE for per-CPU atomics. This means that default kernel
builds will use non-LSE atomics for this case, but will still use LSE
atomics for the global atomic variables that are more likely to be
heavily contended, and thus are more likely to benefit from LSE.
Signed-off-by: Paul E. McKenney <paulmckrcu at fb.com>
Cc: Catalin Marinas <catalin.marinas at arm.com>
Cc: Will Deacon <will at kernel.org>
Cc: <linux-arm-kernel at lists.infradead.org>
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 58b782779138..b91b7cbe4569 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1927,6 +1927,21 @@ config ARM64_USE_LSE_ATOMICS
atomic routines. This incurs a small overhead on CPUs that do
not support these instructions.
+config ARM64_USE_LSE_PERCPU_ATOMICS
+ bool "LSE for per-CPU atomic instructions"
+ default n
+ help
+ As part of the Large System Extensions, ARMv8.1 introduces new
+ atomic instructions that are designed specifically to scale in
+ very large systems. However, contention on per-CPU atomics
+ is usually quite low by design, so these atomics likely benefit
+ from higher performance, even if this is purchased with reduced
+ performance under high contention.
+
+ Say Y here to make use of these instructions for the in-kernel
+ per-CPU atomic routines. This incurs a small overhead on CPUs
+ that do not support these instructions.
+
endmenu # "ARMv8.1 architectural features"
menu "ARMv8.2 architectural features"
diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h
index 3129a5819d0e..2d5eff217d63 100644
--- a/arch/arm64/include/asm/lse.h
+++ b/arch/arm64/include/asm/lse.h
@@ -26,12 +26,19 @@
/* In-line patching at runtime */
#define ARM64_LSE_ATOMIC_INSN(llsc, lse) \
ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#if IS_ENABLED(CONFIG_ARM64_USE_LSE_PERCPU_ATOMICS)
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) \
+ ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#else
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) llsc
+#endif
#else /* CONFIG_ARM64_LSE_ATOMICS */
#define __lse_ll_sc_body(op, ...) __ll_sc_##op(__VA_ARGS__)
#define ARM64_LSE_ATOMIC_INSN(llsc, lse) llsc
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) llsc
#endif /* CONFIG_ARM64_LSE_ATOMICS */
#endif /* __ASM_LSE_H */
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..eaa3c2f87407 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -70,7 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
unsigned int loop; \
u##sz tmp; \
\
- asm volatile (ARM64_LSE_ATOMIC_INSN( \
+ asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN( \
/* LL/SC */ \
"1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \
#op_llsc "\t%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n" \
@@ -91,7 +91,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \
unsigned int loop; \
u##sz ret; \
\
- asm volatile (ARM64_LSE_ATOMIC_INSN( \
+ asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN( \
/* LL/SC */ \
"1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \
#op_llsc "\t%" #w "[ret], %" #w "[ret], %" #w "[val]\n" \
More information about the linux-arm-kernel
mailing list