Overhead of arm64 LSE per-CPU atomics?
Palmer Dabbelt
palmer at dabbelt.com
Thu Nov 6 08:30:05 PST 2025
On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
>> I ran a bunch of cases with those:
> [...]
>> Which I'm interpreting to say the following:
>>
>> * LL/SC is pretty good for the common cases, but gets really bad under the
>> pathological cases. It still seems always slower that LDADD.
>> * STADD has latency that blocks other STADDs, but not other CPU-local work.
>> I'd bet there's a bunch of interactions with caches and memory ordering
>> here, but those would all juts make STADD look worse so I'm just ignoring
>> them.
>> * LDADD is better than STADD even under pathologically highly contended
>> cases. I was actually kind of surprised about this one, I thought the far
>> atomics would be better there.
>> * The prefetches help STADD, but they don't seem to make it better that
>> LDADD in any case.
>> * The LDADD latency also happens concurrently with other CPU operations
>> like the STADD latency does. It has less latency to hide, so the latency
>> starts to go up with less extra work, but it's never worse that STADD.
>>
>> So I think at least on this system, LDADD is just always better.
>
> Thanks for this, very useful. I guess that's expected in the light of I
> learnt from the other Arm engineers in the past couple of days.
OK, sorry if I misunderstood you earlier. From reading your posts I
thought there would be some mode in which STADD was better -- probably
high contention and enough extra work to hide the latency. So I was
kind of surprised to find these results.
More information about the linux-arm-kernel
mailing list