Overhead of arm64 LSE per-CPU atomics?

Thu Nov 6 10:23:10 PST 2025

On Thu, 06 Nov 2025 09:54:31 PST (-0800), Catalin Marinas wrote:
> On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote:
>> On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
>> > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
>> > > I ran a bunch of cases with those:
>> > [...]
>> > > Which I'm interpreting to say the following:
>> > >
>> > > * LL/SC is pretty good for the common cases, but gets really bad under  the
>> > > pathological cases.  It still seems always slower that LDADD.
>> > > * STADD has latency that blocks other STADDs, but not other CPU-local  work.
>> > > I'd bet there's a bunch of interactions with caches and memory  ordering
>> > > here, but those would all juts make STADD look worse so I'm  just ignoring
>> > > them.
>> > > * LDADD is better than STADD even under pathologically highly contended
>> > > cases.  I was actually kind of surprised about this one, I thought the  far
>> > > atomics would be better there.
>> > > * The prefetches help STADD, but they don't seem to make it better that
>> > > LDADD in any case.
>> > > * The LDADD latency also happens concurrently with other CPU operations
>> > > like the STADD latency does.  It has less latency to hide, so the  latency
>> > > starts to go up with less extra work, but it's never worse  that STADD.
>> > >
>> > > So I think at least on this system, LDADD is just always better.
>> >
>> > Thanks for this, very useful. I guess that's expected in the light of I
>> > learnt from the other Arm engineers in the past couple of days.
>>
>> OK, sorry if I misunderstood you earlier.  From reading your posts I thought
>> there would be some mode in which STADD was better -- probably high
>> contention and enough extra work to hide the latency.  So I was kind of
>> surprised to find these results.
>
> I think STADD is better for cases where you update some stat counters
> but you do a lot of work in between. In your microbenchmark, just lots
> of STADDs back to back with NOPs in between (rather than lots of other
> memory transactions) are likely to be slower. If these are real
> use-cases, at some point the hardware may evolve to behave differently
> (or more dynamically).

OK, that's kind of what I was trying to demonstrate when putting 
together those new microbenchmark parameters.  So I think at least I 
understood what you were saying, now I just need to figure out what's 
up...

FWIW: there's actually a bunch of memory traffic, the compiler is doing 
something weird with that NOP loop and generating a bunch of 
load/stores/branches.  I was kind of surprised, but I figured it's 
actually better that way.

Also, I found there's a bug in the microbenchmarks: "tmp" is a global, 
so the LDADD code generates

    00000000004102b0 <__percpu_add_case_64_ldadd>:
      4102b0:       90000189        adrp    x9, 440000 <memcpy at GLIBC_2.17>
      4102b4:       f8210008        ldadd   x1, x8, [x0]
      4102b8:       f9005528        str     x8, [x9, #168]
      4102bc:       d65f03c0        ret

as opposed to the STADD code, which generates

    00000000004102a8 <__percpu_add_case_64_lse>:
      4102a8:       f821001f        stadd   x1, [x0]
      4102ac:       d65f03c0        ret

It doesn't seem to change my results any, but figured I'd say something 
in case anyone else tries to run this stuff (there's a fix up, too).

> BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of
> routines: https://github.com/ARM-software/progress64/tree/master.
> Building it with ATOMICS=yes makes the compiler generate LSE atomics for
> intrinsics like __atomic_fetch_add(). It won't generate STADD because of
> some aspects of the C consistency models (DMB LD wouldn't guarantee
> ordering with a prior STADD).

Awesome, thanks.  I'll go take a look -- I'm trying to figure out enough 
of what's going on to figure out what we should do here, but that's 
mostly outside of kernel space now so I think it's just going to be a 
discussion for somewhere else...