Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 08:25:51 PST 2025

On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > > > For the SRCU case, STADD especially together with the DMB after lock and
> > > > before unlock, executing it far does slow things down. A microbenchmark
> > > > doing this in a loop is a lot worse than it would appear in practice
> > > > (saturating buses down the path to memory).
> > > 
> > > In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> > > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> > > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> > > 
> > > > A quick test to check this theory, if that's the functions you were
> > > > benchmarking (it generates LDADD instead):
> > > 
> > > Thank you for digging into this!
> > 
> > And this_cpu_inc_return() does speed things up on my hardware to about
> > the same extent as did the prefetch instruction, so thank you again.
> > However, it gets me more than a 4x slowdown on x86, so I cannot make
> > this change in common code.
> 
> Definitely not suggesting that we use the 'return' variants in the
> generic code. More likely change the arm64 code to use them for the
> per-CPU atomics.

Whew!!!  ;-)

> > So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
> > something like this_cpu_inc_srcu(), but not for the upcoming merge window,
> > but the one after that, sticking with my current interrupt-disabling
> > non-atomic approach in the meantime (which gets me most of the benefit).
> > Alternatively, would it work for me to put that cache-prefetch instruction
> > into SRCU for arm64?  My guess is "absolutely not!", but I figured that
> > I should ask.
> 
> Given that this_cpu_*() are meant for the local CPU, there's less risk
> of cache line bouncing between CPUs, so I'm happy to change them to
> either use PRFM or LDADD (I think I prefer the latter). This would not
> be a generic change for the other atomics, only the per-CPU ones.

I have easy access to only the one type of ARM system, and of course
the choice must be driven by a wide range of systems.  But yes, it
would be much better if we can just use this_cpu_inc().  I will use the
non-atomics protected by interrupt disabling in the meantime, but look
forward to being able to switch back.

> > But if both of these approaches proves problematic, I might need some
> > way to distinguish between systems having slow LSE and those that do not.
> 
> It's not that systems have slow or fast atomics, more like they are slow
> or fast for specific use-cases. Their default behaviour may differ and
> at least in the Arm Ltd cases, this is configurable. An STADD executed
> in the L1 cache (near) may be better for your case and some
> microbenchmarks but not necessarily for others. I've heard of results of
> database use-cases where STADD executed far is better than LDADD
> executed near when the location is shared between multiple CPUs. In
> these cases even a PRFM can be problematic as it tends to bring a unique
> copy of the cacheline invalidating the others (well, again, microarch
> specific).

Fair point, and I do need to be careful not to read too much into the
results from my one type of system.  Plus, to your point elsewhere in
this thread, making the hardware better would be quite welcome as well.

> For the Arm Ltd implementations, I think the behaviour for most of the
> (recent) CPUs is that load atomics, CAS, SWP are executed near while the
> store atomics far (subject to configuration, errata, interconnect). Arm
> should probably provide some guidance here so that other implementers
> and software people know how/when to use them.

Or make the hardware figure out what to do automatically for each use
case as it executes.  Perhaps a bit utopian, but it is nevertheless a
good direction to aim for.

							Thanx, Paul

> -- 
> Catalin