Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 09:40:32 PST 2025

On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> > > Given that this_cpu_*() are meant for the local CPU, there's less risk
> > > of cache line bouncing between CPUs, so I'm happy to change them to
> > > either use PRFM or LDADD (I think I prefer the latter). This would not
> > > be a generic change for the other atomics, only the per-CPU ones.
> > 
> > I have easy access to only the one type of ARM system, and of course
> > the choice must be driven by a wide range of systems.  But yes, it
> > would be much better if we can just use this_cpu_inc().  I will use the
> > non-atomics protected by interrupt disabling in the meantime, but look
> > forward to being able to switch back.
> 
> BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
> or just in a microbenchmark hammering them? From what I understand from
> the hardware folk, doing STADD in a loop saturates some queues in the
> interconnect and slows down eventually. In normal use, it's just a
> posted operation not affecting the subsequent instructions (or at least
> that's the theory).

Only in a microbenchmark, and Breno did not find any issues in larger
benchmarks, so good to hear!

Now, some non-arm64 systems deal with it just fine, but perhaps I owe
everyone an apology for the firedrill.

But let me put it this way...  Would you ack an SRCU patch that resulted
in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
other systems?

							Thanx, Paul