Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 10:43:02 PST 2025

On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > >  	unsigned int loop;						\
> > > >  	u##sz tmp;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > >  	unsigned int loop;						\
> > > >  	u##sz ret;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > -----------------8<------------------------
> > > 
> > > I will give this a shot, thank you!
> > 
> > Jackpot!!!
> > 
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853.  Still room for improvement, but much
> > better than the 100ns values.
> > 
> > I presume that you will send this up the normal path, but in the meantime,
> > I will pull this in for further local testing, and thank you!
> 
> After an educative discussion with the microarchitects, I think the
> hardware is behaving as intended, it just doesn't always fit the
> software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> no need to return the value read from memory). This is typically
> executed as "far" or posted (unless it hits in the L1 cache) and
> intended for stat updates. At a quick grep, it matches the majority of
> the use-cases in Linux. Most other atomics (those with a return) are
> executed "near", so filling the cache lines (assuming default CPUECTLR
> configuration).

OK...

> For the SRCU case, STADD especially together with the DMB after lock and
> before unlock, executing it far does slow things down. A microbenchmark
> doing this in a loop is a lot worse than it would appear in practice
> (saturating buses down the path to memory).

In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
(The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)

> A quick test to check this theory, if that's the functions you were
> benchmarking (it generates LDADD instead):

Thank you for digging into this!

> ---------------------8<----------------------------------------
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 42098e0fa0b7..5a6f3999883d 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
>  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>  
>  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> -		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> +		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
>  	else
>  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
>  	barrier(); /* Avoid leaking the critical section. */
> @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
>  {
>  	barrier();  /* Avoid leaking the critical section. */
>  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> -		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> +		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
>  	else
>  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
>  }
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 1ff94b76d91f..c025d9135689 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
>  {
>  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>  
> -	this_cpu_inc(scp->srcu_locks.counter);
> +	this_cpu_inc_return(scp->srcu_locks.counter);
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	return __srcu_ptr_to_ctr(ssp, scp);
>  }
> @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
>  void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
>  {
>  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> -	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> +	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>  
> ---------------------8<----------------------------------------
> 
> To make things better for the non-fast variants above, we should add
> this_cpu_inc_return_acquire() etc. semantics (strangely,
> this_cpu_inc_return() doesn't have full barrier semantics as
> atomic_inc_return()).
> 
> I'm not sure about adding the prefetch since most other uses of
> this_cpu_add() are meant for stat updates and there's not much point in
> brining in a cache line. I think we could add release/acquire variants
> that generate LDADDA/L and maybe a slightly different API for the
> __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> full barrier semantics to the current _return() variants.

But other architectures might well have this_cpu_inc_return() running
more slowly than this_cpu_inc().  So my thought would be to make a
this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
this_cpu_inc() elsewhere.

I could imagine this_cpu_inc_local() or some such, but it is not clear
that the added API explosion is yet justified.

Or is there a better way?

							Thanx, Paul