Overhead of arm64 LSE per-CPU atomics?
Paul E. McKenney
paulmck at kernel.org
Tue Nov 4 12:10:36 PST 2025
On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > > index 9abcc8ef3087..e381034324e1 100644
> > > > > --- a/arch/arm64/include/asm/percpu.h
> > > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
> > > > > unsigned int loop; \
> > > > > u##sz tmp; \
> > > > > \
> > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
> > > > > /* LL/SC */ \
> > > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \
> > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \
> > > > > unsigned int loop; \
> > > > > u##sz ret; \
> > > > > \
> > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
> > > > > /* LL/SC */ \
> > > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \
> > > > > -----------------8<------------------------
> > > >
> > > > I will give this a shot, thank you!
> > >
> > > Jackpot!!!
> > >
> > > This reduces the overhead to 8.427, which is significantly better than
> > > the non-LSE value of 9.853. Still room for improvement, but much
> > > better than the 100ns values.
> > >
> > > I presume that you will send this up the normal path, but in the meantime,
> > > I will pull this in for further local testing, and thank you!
> >
> > After an educative discussion with the microarchitects, I think the
> > hardware is behaving as intended, it just doesn't always fit the
> > software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> > Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> > no need to return the value read from memory). This is typically
> > executed as "far" or posted (unless it hits in the L1 cache) and
> > intended for stat updates. At a quick grep, it matches the majority of
> > the use-cases in Linux. Most other atomics (those with a return) are
> > executed "near", so filling the cache lines (assuming default CPUECTLR
> > configuration).
>
> OK...
>
> > For the SRCU case, STADD especially together with the DMB after lock and
> > before unlock, executing it far does slow things down. A microbenchmark
> > doing this in a loop is a lot worse than it would appear in practice
> > (saturating buses down the path to memory).
>
> In this srcu_read_lock_fast_updown() case, there was no DMB. But for
> srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
>
> > A quick test to check this theory, if that's the functions you were
> > benchmarking (it generates LDADD instead):
>
> Thank you for digging into this!
And this_cpu_inc_return() does speed things up on my hardware to about
the same extent as did the prefetch instruction, so thank you again.
However, it gets me more than a 4x slowdown on x86, so I cannot make
this change in common code.
So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
something like this_cpu_inc_srcu(), but not for the upcoming merge window,
but the one after that, sticking with my current interrupt-disabling
non-atomic approach in the meantime (which gets me most of the benefit).
Alternatively, would it work for me to put that cache-prefetch instruction
into SRCU for arm64? My guess is "absolutely not!", but I figured that
I should ask.
But if both of these approaches proves problematic, I might need some
way to distinguish between systems having slow LSE and those that do not.
Or I can stick with disabling interrupts across non-atomic updates.
Thoughts?
Thanx, Paul
> > ---------------------8<----------------------------------------
> > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> > index 42098e0fa0b7..5a6f3999883d 100644
> > --- a/include/linux/srcutree.h
> > +++ b/include/linux/srcutree.h
> > @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
> > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >
> > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > - this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> > + this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> > else
> > atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks)); // Y, and implicit RCU reader.
> > barrier(); /* Avoid leaking the critical section. */
> > @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
> > {
> > barrier(); /* Avoid leaking the critical section. */
> > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > - this_cpu_inc(scp->srcu_unlocks.counter); // Z, and implicit RCU reader.
> > + this_cpu_inc_return(scp->srcu_unlocks.counter); // Z, and implicit RCU reader.
> > else
> > atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks)); // Z, and implicit RCU reader.
> > }
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 1ff94b76d91f..c025d9135689 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
> > {
> > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >
> > - this_cpu_inc(scp->srcu_locks.counter);
> > + this_cpu_inc_return(scp->srcu_locks.counter);
> > smp_mb(); /* B */ /* Avoid leaking the critical section. */
> > return __srcu_ptr_to_ctr(ssp, scp);
> > }
> > @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> > void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> > {
> > smp_mb(); /* C */ /* Avoid leaking the critical section. */
> > - this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> > + this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> > }
> > EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >
> > ---------------------8<----------------------------------------
> >
> > To make things better for the non-fast variants above, we should add
> > this_cpu_inc_return_acquire() etc. semantics (strangely,
> > this_cpu_inc_return() doesn't have full barrier semantics as
> > atomic_inc_return()).
> >
> > I'm not sure about adding the prefetch since most other uses of
> > this_cpu_add() are meant for stat updates and there's not much point in
> > brining in a cache line. I think we could add release/acquire variants
> > that generate LDADDA/L and maybe a slightly different API for the
> > __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> > full barrier semantics to the current _return() variants.
>
> But other architectures might well have this_cpu_inc_return() running
> more slowly than this_cpu_inc(). So my thought would be to make a
> this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
> this_cpu_inc() elsewhere.
>
> I could imagine this_cpu_inc_local() or some such, but it is not clear
> that the added API explosion is yet justified.
>
> Or is there a better way?
>
> Thanx, Paul
More information about the linux-arm-kernel
mailing list