Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 09:05:02 PST 2025

On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > index 9abcc8ef3087..e381034324e1 100644
> > > --- a/arch/arm64/include/asm/percpu.h
> > > +++ b/arch/arm64/include/asm/percpu.h
> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > >  	unsigned int loop;						\
> > >  	u##sz tmp;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > >  	unsigned int loop;						\
> > >  	u##sz ret;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > -----------------8<------------------------
> > 
> > I will give this a shot, thank you!
> 
> Jackpot!!!
> 
> This reduces the overhead to 8.427, which is significantly better than
> the non-LSE value of 9.853.  Still room for improvement, but much
> better than the 100ns values.
> 
> I presume that you will send this up the normal path, but in the meantime,
> I will pull this in for further local testing, and thank you!

After an educative discussion with the microarchitects, I think the
hardware is behaving as intended, it just doesn't always fit the
software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
no need to return the value read from memory). This is typically
executed as "far" or posted (unless it hits in the L1 cache) and
intended for stat updates. At a quick grep, it matches the majority of
the use-cases in Linux. Most other atomics (those with a return) are
executed "near", so filling the cache lines (assuming default CPUECTLR
configuration).

For the SRCU case, STADD especially together with the DMB after lock and
before unlock, executing it far does slow things down. A microbenchmark
doing this in a loop is a lot worse than it would appear in practice
(saturating buses down the path to memory).

A quick test to check this theory, if that's the functions you were
benchmarking (it generates LDADD instead):

---------------------8<----------------------------------------

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 42098e0fa0b7..5a6f3999883d 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
 	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
 
 	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
-		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
+		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
 	else
 		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
 	barrier(); /* Avoid leaking the critical section. */
@@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
 {
 	barrier();  /* Avoid leaking the critical section. */
 	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
-		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
+		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
 	else
 		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
 }
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 1ff94b76d91f..c025d9135689 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
 {
 	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
 
-	this_cpu_inc(scp->srcu_locks.counter);
+	this_cpu_inc_return(scp->srcu_locks.counter);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return __srcu_ptr_to_ctr(ssp, scp);
 }
@@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
 {
 	smp_mb(); /* C */  /* Avoid leaking the critical section. */
-	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
+	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
 
---------------------8<----------------------------------------

To make things better for the non-fast variants above, we should add
this_cpu_inc_return_acquire() etc. semantics (strangely,
this_cpu_inc_return() doesn't have full barrier semantics as
atomic_inc_return()).

I'm not sure about adding the prefetch since most other uses of
this_cpu_add() are meant for stat updates and there's not much point in
brining in a cache line. I think we could add release/acquire variants
that generate LDADDA/L and maybe a slightly different API for the
__srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
full barrier semantics to the current _return() variants.

-- 
Catalin