Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 23:44:39 PST 2025

On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > > But need to add the prefetch in per-cpu implementation as you've
> > > > noticed above (didn't add it since no prefetch for LL/SC
> > > > implementation there, maybe a missing?)
> > > 
> > > Maybe no-one stressed these to notice any difference between LL/SC and
> > > LSE.
> > 
> > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> > having faced catastrophic performance there on haproxy, while with LSE
> > it continues to scale almost linearly at least till 64.
> 
> I was referring only to the this_cpu_add() etc. functions (until Paul
> started using them).

Ah OK thanks for clarifying!

> There definitely have been lots of benchmarks on
> the scalability of LL/SC. That's one of the reasons Arm added the LSE
> atomics years ago.

Yes that's what I thought, which is why your sentence shocked me in the
first place :-)

> > But that does
> > not mean that if some possibilities are within reach to recover 90% of
> > the atomic overhead in uncontended case we shouldn't try to grab it at
> > a reasonable cost!
> 
> I agree. Even for these cases, I don't think the solution is LL/SC but
> rather better use of LSE (and better understanding of the hardware
> behaviour; feedback here should go both ways).

I totally agree. I'm happy to have discovered the near vs far distinction
there that I was not aware of because it will make me think differently
in the future when having to design around shared stuff.

> > I'm definitely adding in my todo list to experiment more on this on
> > various CPUs now ;-)
> 
> Thanks for the tests so far, very insightful. I think what's still
> good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
> Breno's microbenchmarks. I suspect LDADD is still better.

Yep as confirmed with Breno's last test after your message.

> FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are
> all forced near, so this explains the consistent results you got with
> STADD on this CPU. On other CPUs, STADD would likely be executed far
> unless it hits in the L1 cache.

Ah, thanks for letting me know! This indeed explains the difference.
Do you have pointers to some docs suggesting what instructions to use
when you prefer a near or far operation, like here with stadd vs ldadd ?
Also does this mean that with LSE a pure store will always be far unless
prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0
to hint a near vs far store for example ? I'm also wondering about CAS,
if there's a way to perform the usual load+CAS sequence exclusively using
far operations to avoid cache lines bouncing in contended environments,
because there are cases where a constant 50-60ns per CAS would be awesome,
or maybe even a CAS that remains far in case of failure or triggers the
prefetch of the line in case of success, for the typical
CAS(ptr, NULL, mine) used to try to own a shared resource.

Thanks,
Willy