Overhead of arm64 LSE per-CPU atomics?

Fri Oct 31 15:21:38 PDT 2025

On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > 
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
> 
> In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?
> 
> Yes, this gets me more than an order of magnitude, and about 30% better
> than my workaround of disabling interrupts around a non-atomic increment
> of those counters, thank you!
> 
> Given that per-CPU atomics are usually not heavily contended, would it
> make sense to avoid LSE in that case?

For example, how about something like the patch below?

							Thanx, Paul

------------------------------------------------------------------------

commit 0c0b71d19c997915c5ef5fe7e32eb56b4e4a750e
Author: Paul E. McKenney <paulmckrcu at fb.com>
Date:   Fri Oct 31 14:14:13 2025 -0700

    arm64: Separately select LSE for per-CPU atomics
    
    LSE atomics provide better scalability, but not always better single-CPU
    performance.  In fact, on the ARM Neoverse V2, they degrade single-CPU
    performance by an order of magnitude, from about 5ns per operation to
    about 50ns.
    
    Now per-CPU atomics are rarely contended, in fact, a given per-CPU
    variable is usually used mostly by the CPU in question.  This means
    that LSE's better scalability does not help, but its degraded single-CPU
    performance does hurt.
    
    Therefore, provide a new default-n ARM64_USE_LSE_PERCPU_ATOMICS Kconfig
    option that uses LSE for per-CPU atomics.  This means that default kernel
    builds will use non-LSE atomics for this case, but will still use LSE
    atomics for the global atomic variables that are more likely to be
    heavily contended, and thus are more likely to benefit from LSE.
    
    Signed-off-by: Paul E. McKenney <paulmckrcu at fb.com>
    Cc: Catalin Marinas <catalin.marinas at arm.com>
    Cc: Will Deacon <will at kernel.org>
    Cc: <linux-arm-kernel at lists.infradead.org>

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 58b782779138..b91b7cbe4569 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1927,6 +1927,21 @@ config ARM64_USE_LSE_ATOMICS
 	  atomic routines. This incurs a small overhead on CPUs that do
 	  not support these instructions.
 
+config ARM64_USE_LSE_PERCPU_ATOMICS
+	bool "LSE for per-CPU atomic instructions"
+	default n
+	help
+	  As part of the Large System Extensions, ARMv8.1 introduces new
+	  atomic instructions that are designed specifically to scale in
+	  very large systems.  However, contention on per-CPU atomics
+	  is usually quite low by design, so these atomics likely benefit
+	  from higher performance, even if this is purchased with reduced
+	  performance under high contention.
+
+	  Say Y here to make use of these instructions for the in-kernel
+	  per-CPU atomic routines. This incurs a small overhead on CPUs
+	  that do not support these instructions.
+
 endmenu # "ARMv8.1 architectural features"
 
 menu "ARMv8.2 architectural features"
diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h
index 3129a5819d0e..2d5eff217d63 100644
--- a/arch/arm64/include/asm/lse.h
+++ b/arch/arm64/include/asm/lse.h
@@ -26,12 +26,19 @@
 /* In-line patching at runtime */
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)				\
 	ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#if IS_ENABLED(CONFIG_ARM64_USE_LSE_PERCPU_ATOMICS)
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)				\
+	ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#else
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)	llsc
+#endif
 
 #else	/* CONFIG_ARM64_LSE_ATOMICS */
 
 #define __lse_ll_sc_body(op, ...)		__ll_sc_##op(__VA_ARGS__)
 
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)	llsc
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)	llsc
 
 #endif	/* CONFIG_ARM64_LSE_ATOMICS */
 #endif	/* __ASM_LSE_H */
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..eaa3c2f87407 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -70,7 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	unsigned int loop;						\
 	u##sz tmp;							\
 									\
-	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
+	asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN(			\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
 		#op_llsc "\t%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n"	\
@@ -91,7 +91,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
 	unsigned int loop;						\
 	u##sz ret;							\
 									\
-	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
+	asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN(			\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
 		#op_llsc "\t%" #w "[ret], %" #w "[ret], %" #w "[val]\n"	\