Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 06:49:39 PST 2025

On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > But need to add the prefetch in per-cpu implementation as you've
> > > noticed above (didn't add it since no prefetch for LL/SC
> > > implementation there, maybe a missing?)
> > 
> > Maybe no-one stressed these to notice any difference between LL/SC and
> > LSE.
> 
> Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> having faced catastrophic performance there on haproxy, while with LSE
> it continues to scale almost linearly at least till 64.

I was referring only to the this_cpu_add() etc. functions (until Paul
started using them). There definitely have been lots of benchmarks on
the scalability of LL/SC. That's one of the reasons Arm added the LSE
atomics years ago.

> But that does
> not mean that if some possibilities are within reach to recover 90% of
> the atomic overhead in uncontended case we shouldn't try to grab it at
> a reasonable cost!

I agree. Even for these cases, I don't think the solution is LL/SC but
rather better use of LSE (and better understanding of the hardware
behaviour; feedback here should go both ways).

> I'm definitely adding in my todo list to experiment more on this on
> various CPUs now ;-)

Thanks for the tests so far, very insightful. I think what's still
good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
Breno's microbenchmarks. I suspect LDADD is still better.

FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are
all forced near, so this explains the consistent results you got with
STADD on this CPU. On other CPUs, STADD would likely be executed far
unless it hits in the L1 cache.

-- 
Catalin