Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 05:42:31 PST 2025

On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > But need to add the prefetch in per-cpu implementation as you've
> > noticed above (didn't add it since no prefetch for LL/SC
> > implementation there, maybe a missing?)
> 
> Maybe no-one stressed these to notice any difference between LL/SC and
> LSE.

Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
having faced catastrophic performance there on haproxy, while with LSE
it continues to scale almost linearly at least till 64. But that does
not mean that if some possibilities are within reach to recover 90% of
the atomic overhead in uncontended case we shouldn't try to grab it at
a reasonable cost!

I'm definitely adding in my todo list to experiment more on this on
various CPUs now ;-)

Willy