Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 12:35:48 PST 2025

On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote:
> > So it seems at first glance that LL/SC is generally slower but can be
> > more consistent on modern machines, that LSE is stable on older machines
> > and can be stable sometimes even on some modern machines.
> 
> I guess that I am glad that I am not alone?  ;-)
> 
> I am guessing that there is no reasonable way to check for whether a
> given system has slow LSE, as would be needed to use ALTERNATIVE(),
> but please let me know if I am mistaken.

I don't know either, and we've only tested additions (for which ldadd
seems to do a better job than stadd for local values). I have no idea
what happens with a CAS for example, that could be useful to set a max
value for a metric and which can be quite inefficient using LL/SC,
especially if the absolute value is stored in the same cache line as
the max since every thread touching it would probably invalidate the
update attempt. With a SWP instruction I don't see how it would be
handled directly in SLC, since we need to know the previous value,
hence load it into L1 (and hope nobody changes it between the load
and the write attempt). But overall there seems to be a lot of
unexplored possibilities here which I find quite interesting!

Willy