Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 07:34:21 PST 2025

On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > > For the SRCU case, STADD especially together with the DMB after lock and
> > > before unlock, executing it far does slow things down. A microbenchmark
> > > doing this in a loop is a lot worse than it would appear in practice
> > > (saturating buses down the path to memory).
> > 
> > In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> > 
> > > A quick test to check this theory, if that's the functions you were
> > > benchmarking (it generates LDADD instead):
> > 
> > Thank you for digging into this!
> 
> And this_cpu_inc_return() does speed things up on my hardware to about
> the same extent as did the prefetch instruction, so thank you again.
> However, it gets me more than a 4x slowdown on x86, so I cannot make
> this change in common code.

Definitely not suggesting that we use the 'return' variants in the
generic code. More likely change the arm64 code to use them for the
per-CPU atomics.

> So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
> something like this_cpu_inc_srcu(), but not for the upcoming merge window,
> but the one after that, sticking with my current interrupt-disabling
> non-atomic approach in the meantime (which gets me most of the benefit).
> Alternatively, would it work for me to put that cache-prefetch instruction
> into SRCU for arm64?  My guess is "absolutely not!", but I figured that
> I should ask.

Given that this_cpu_*() are meant for the local CPU, there's less risk
of cache line bouncing between CPUs, so I'm happy to change them to
either use PRFM or LDADD (I think I prefer the latter). This would not
be a generic change for the other atomics, only the per-CPU ones.

> But if both of these approaches proves problematic, I might need some
> way to distinguish between systems having slow LSE and those that do not.

It's not that systems have slow or fast atomics, more like they are slow
or fast for specific use-cases. Their default behaviour may differ and
at least in the Arm Ltd cases, this is configurable. An STADD executed
in the L1 cache (near) may be better for your case and some
microbenchmarks but not necessarily for others. I've heard of results of
database use-cases where STADD executed far is better than LDADD
executed near when the location is shared between multiple CPUs. In
these cases even a PRFM can be problematic as it tends to bring a unique
copy of the cacheline invalidating the others (well, again, microarch
specific).

For the Arm Ltd implementations, I think the behaviour for most of the
(recent) CPUs is that load atomics, CAS, SWP are executed near while the
store atomics far (subject to configuration, errata, interconnect). Arm
should probably provide some guidance here so that other implementers
and software people know how/when to use them.

-- 
Catalin