Overhead of arm64 LSE per-CPU atomics?

Thu Nov 6 09:54:31 PST 2025

On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote:
> On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
> > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
> > > I ran a bunch of cases with those:
> > [...]
> > > Which I'm interpreting to say the following:
> > > 
> > > * LL/SC is pretty good for the common cases, but gets really bad under  the
> > > pathological cases.  It still seems always slower that LDADD.
> > > * STADD has latency that blocks other STADDs, but not other CPU-local  work.
> > > I'd bet there's a bunch of interactions with caches and memory  ordering
> > > here, but those would all juts make STADD look worse so I'm  just ignoring
> > > them.
> > > * LDADD is better than STADD even under pathologically highly contended
> > > cases.  I was actually kind of surprised about this one, I thought the  far
> > > atomics would be better there.
> > > * The prefetches help STADD, but they don't seem to make it better that
> > > LDADD in any case.
> > > * The LDADD latency also happens concurrently with other CPU operations
> > > like the STADD latency does.  It has less latency to hide, so the  latency
> > > starts to go up with less extra work, but it's never worse  that STADD.
> > > 
> > > So I think at least on this system, LDADD is just always better.
> > 
> > Thanks for this, very useful. I guess that's expected in the light of I
> > learnt from the other Arm engineers in the past couple of days.
> 
> OK, sorry if I misunderstood you earlier.  From reading your posts I thought
> there would be some mode in which STADD was better -- probably high
> contention and enough extra work to hide the latency.  So I was kind of
> surprised to find these results.

I think STADD is better for cases where you update some stat counters
but you do a lot of work in between. In your microbenchmark, just lots
of STADDs back to back with NOPs in between (rather than lots of other
memory transactions) are likely to be slower. If these are real
use-cases, at some point the hardware may evolve to behave differently
(or more dynamically).

BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of
routines: https://github.com/ARM-software/progress64/tree/master.
Building it with ATOMICS=yes makes the compiler generate LSE atomics for
intrinsics like __atomic_fetch_add(). It won't generate STADD because of
some aspects of the C consistency models (DMB LD wouldn't guarantee
ordering with a prior STADD).

-- 
Catalin