Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 13:25:51 PST 2025

On Tue, Nov 04, 2025 at 09:35:48PM +0100, Willy Tarreau wrote:
> On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote:
> > > So it seems at first glance that LL/SC is generally slower but can be
> > > more consistent on modern machines, that LSE is stable on older machines
> > > and can be stable sometimes even on some modern machines.
> > 
> > I guess that I am glad that I am not alone?  ;-)
> > 
> > I am guessing that there is no reasonable way to check for whether a
> > given system has slow LSE, as would be needed to use ALTERNATIVE(),
> > but please let me know if I am mistaken.
> 
> I don't know either, and we've only tested additions (for which ldadd
> seems to do a better job than stadd for local values). I have no idea
> what happens with a CAS for example, that could be useful to set a max
> value for a metric and which can be quite inefficient using LL/SC,
> especially if the absolute value is stored in the same cache line as
> the max since every thread touching it would probably invalidate the
> update attempt. With a SWP instruction I don't see how it would be
> handled directly in SLC, since we need to know the previous value,
> hence load it into L1 (and hope nobody changes it between the load
> and the write attempt). But overall there seems to be a lot of
> unexplored possibilities here which I find quite interesting!

I must admit that this is a fun one.  ;-)

							Thanx, Paul