Overhead of arm64 LSE per-CPU atomics?

Thu Nov 6 06:00:59 PST 2025

On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
> I ran a bunch of cases with those:
[...]
> Which I'm interpreting to say the following:
> 
> * LL/SC is pretty good for the common cases, but gets really bad under  the
> pathological cases.  It still seems always slower that LDADD.
> * STADD has latency that blocks other STADDs, but not other CPU-local  work.
> I'd bet there's a bunch of interactions with caches and memory  ordering
> here, but those would all juts make STADD look worse so I'm  just ignoring
> them.
> * LDADD is better than STADD even under pathologically highly contended
> cases.  I was actually kind of surprised about this one, I thought the  far
> atomics would be better there.
> * The prefetches help STADD, but they don't seem to make it better that
> LDADD in any case.
> * The LDADD latency also happens concurrently with other CPU operations
> like the STADD latency does.  It has less latency to hide, so the  latency
> starts to go up with less extra work, but it's never worse  that STADD.
> 
> So I think at least on this system, LDADD is just always better.

Thanks for this, very useful. I guess that's expected in the light of I
learnt from the other Arm engineers in the past couple of days.

-- 
Catalin