Overhead of arm64 LSE per-CPU atomics?
Paul E. McKenney
paulmck at kernel.org
Tue Nov 4 10:43:02 PST 2025
On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
> > > > unsigned int loop; \
> > > > u##sz tmp; \
> > > > \
> > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
> > > > /* LL/SC */ \
> > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \
> > > > unsigned int loop; \
> > > > u##sz ret; \
> > > > \
> > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
> > > > /* LL/SC */ \
> > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \
> > > > -----------------8<------------------------
> > >
> > > I will give this a shot, thank you!
> >
> > Jackpot!!!
> >
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853. Still room for improvement, but much
> > better than the 100ns values.
> >
> > I presume that you will send this up the normal path, but in the meantime,
> > I will pull this in for further local testing, and thank you!
>
> After an educative discussion with the microarchitects, I think the
> hardware is behaving as intended, it just doesn't always fit the
> software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> no need to return the value read from memory). This is typically
> executed as "far" or posted (unless it hits in the L1 cache) and
> intended for stat updates. At a quick grep, it matches the majority of
> the use-cases in Linux. Most other atomics (those with a return) are
> executed "near", so filling the cache lines (assuming default CPUECTLR
> configuration).
OK...
> For the SRCU case, STADD especially together with the DMB after lock and
> before unlock, executing it far does slow things down. A microbenchmark
> doing this in a loop is a lot worse than it would appear in practice
> (saturating buses down the path to memory).
In this srcu_read_lock_fast_updown() case, there was no DMB. But for
srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
(The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> A quick test to check this theory, if that's the functions you were
> benchmarking (it generates LDADD instead):
Thank you for digging into this!
> ---------------------8<----------------------------------------
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 42098e0fa0b7..5a6f3999883d 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
> struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>
> if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> - this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> + this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> else
> atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks)); // Y, and implicit RCU reader.
> barrier(); /* Avoid leaking the critical section. */
> @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
> {
> barrier(); /* Avoid leaking the critical section. */
> if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> - this_cpu_inc(scp->srcu_unlocks.counter); // Z, and implicit RCU reader.
> + this_cpu_inc_return(scp->srcu_unlocks.counter); // Z, and implicit RCU reader.
> else
> atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks)); // Z, and implicit RCU reader.
> }
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 1ff94b76d91f..c025d9135689 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
> {
> struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>
> - this_cpu_inc(scp->srcu_locks.counter);
> + this_cpu_inc_return(scp->srcu_locks.counter);
> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> return __srcu_ptr_to_ctr(ssp, scp);
> }
> @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> {
> smp_mb(); /* C */ /* Avoid leaking the critical section. */
> - this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> + this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> }
> EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>
> ---------------------8<----------------------------------------
>
> To make things better for the non-fast variants above, we should add
> this_cpu_inc_return_acquire() etc. semantics (strangely,
> this_cpu_inc_return() doesn't have full barrier semantics as
> atomic_inc_return()).
>
> I'm not sure about adding the prefetch since most other uses of
> this_cpu_add() are meant for stat updates and there's not much point in
> brining in a cache line. I think we could add release/acquire variants
> that generate LDADDA/L and maybe a slightly different API for the
> __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> full barrier semantics to the current _return() variants.
But other architectures might well have this_cpu_inc_return() running
more slowly than this_cpu_inc(). So my thought would be to make a
this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
this_cpu_inc() elsewhere.
I could imagine this_cpu_inc_local() or some such, but it is not clear
that the added API explosion is yet justified.
Or is there a better way?
Thanx, Paul
More information about the linux-arm-kernel
mailing list