Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 12:10:36 PST 2025

On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > > index 9abcc8ef3087..e381034324e1 100644
> > > > > --- a/arch/arm64/include/asm/percpu.h
> > > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > > >  	unsigned int loop;						\
> > > > >  	u##sz tmp;							\
> > > > >  									\
> > > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > > >  	/* LL/SC */							\
> > > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > > >  	unsigned int loop;						\
> > > > >  	u##sz ret;							\
> > > > >  									\
> > > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > > >  	/* LL/SC */							\
> > > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > > -----------------8<------------------------
> > > > 
> > > > I will give this a shot, thank you!
> > > 
> > > Jackpot!!!
> > > 
> > > This reduces the overhead to 8.427, which is significantly better than
> > > the non-LSE value of 9.853.  Still room for improvement, but much
> > > better than the 100ns values.
> > > 
> > > I presume that you will send this up the normal path, but in the meantime,
> > > I will pull this in for further local testing, and thank you!
> > 
> > After an educative discussion with the microarchitects, I think the
> > hardware is behaving as intended, it just doesn't always fit the
> > software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> > Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> > no need to return the value read from memory). This is typically
> > executed as "far" or posted (unless it hits in the L1 cache) and
> > intended for stat updates. At a quick grep, it matches the majority of
> > the use-cases in Linux. Most other atomics (those with a return) are
> > executed "near", so filling the cache lines (assuming default CPUECTLR
> > configuration).
> 
> OK...
> 
> > For the SRCU case, STADD especially together with the DMB after lock and
> > before unlock, executing it far does slow things down. A microbenchmark
> > doing this in a loop is a lot worse than it would appear in practice
> > (saturating buses down the path to memory).
> 
> In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> 
> > A quick test to check this theory, if that's the functions you were
> > benchmarking (it generates LDADD instead):
> 
> Thank you for digging into this!

And this_cpu_inc_return() does speed things up on my hardware to about
the same extent as did the prefetch instruction, so thank you again.
However, it gets me more than a 4x slowdown on x86, so I cannot make
this change in common code.

So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
something like this_cpu_inc_srcu(), but not for the upcoming merge window,
but the one after that, sticking with my current interrupt-disabling
non-atomic approach in the meantime (which gets me most of the benefit).
Alternatively, would it work for me to put that cache-prefetch instruction
into SRCU for arm64?  My guess is "absolutely not!", but I figured that
I should ask.

But if both of these approaches proves problematic, I might need some
way to distinguish between systems having slow LSE and those that do not.
Or I can stick with disabling interrupts across non-atomic updates.

Thoughts?

							Thanx, Paul

> > ---------------------8<----------------------------------------
> > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> > index 42098e0fa0b7..5a6f3999883d 100644
> > --- a/include/linux/srcutree.h
> > +++ b/include/linux/srcutree.h
> > @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
> >  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >  
> >  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > -		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> > +		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> >  	else
> >  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
> >  	barrier(); /* Avoid leaking the critical section. */
> > @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
> >  {
> >  	barrier();  /* Avoid leaking the critical section. */
> >  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > -		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> > +		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> >  	else
> >  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
> >  }
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 1ff94b76d91f..c025d9135689 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
> >  {
> >  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >  
> > -	this_cpu_inc(scp->srcu_locks.counter);
> > +	this_cpu_inc_return(scp->srcu_locks.counter);
> >  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >  	return __srcu_ptr_to_ctr(ssp, scp);
> >  }
> > @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> >  void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> >  {
> >  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> > -	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> > +	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> >  }
> >  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >  
> > ---------------------8<----------------------------------------
> > 
> > To make things better for the non-fast variants above, we should add
> > this_cpu_inc_return_acquire() etc. semantics (strangely,
> > this_cpu_inc_return() doesn't have full barrier semantics as
> > atomic_inc_return()).
> > 
> > I'm not sure about adding the prefetch since most other uses of
> > this_cpu_add() are meant for stat updates and there's not much point in
> > brining in a cache line. I think we could add release/acquire variants
> > that generate LDADDA/L and maybe a slightly different API for the
> > __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> > full barrier semantics to the current _return() variants.
> 
> But other architectures might well have this_cpu_inc_return() running
> more slowly than this_cpu_inc().  So my thought would be to make a
> this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
> this_cpu_inc() elsewhere.
> 
> I could imagine this_cpu_inc_local() or some such, but it is not clear
> that the added API explosion is yet justified.
> 
> Or is there a better way?
> 
> 							Thanx, Paul