Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 12:13:53 PST 2025

On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote:
> Hello Breno,
> 
> On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> > while LL/SC case is stable.
> > In some case, LSE function runs at the same latency as LL/SC function and
> > slightly faster on p50, but, something happen to the system and LSE operations
> > start to take way longer than LL/SC.
> > 
> > Here are some interesting output coming from the latency of the functions above>
> > 
> > 	CPU: 47 - Latency Percentiles:
> > 	====================
> > 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> > 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
> (...)

Thank you very much for the detailed testing on a variety of hardware
platforms!!!

> Very interesting. I've run them here on a 80-core Ampere Altra made
> of Neoverse-N1 (armv8.2) and am getting very consistently better timings
> with LSE than LL/SC:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.02 ns
>   (...)
> 
> They're *all* like this, between 7.32 and 7.36 for LL/SC p99,
> and 5.01 to 5.03 for LSE p99.
> 
> However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've
> observed, i.e. a lot of variations that do not even depend on big
> vs little cores:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 7.13 ns    p99: 8.81 ns
>   LSE  :   p50: 45.79 ns    p95: 45.80 ns   p99: 45.86 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
>   LSE  :   p50: 67.72 ns    p95: 67.78 ns   p99: 67.80 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
>   LSE  :   p50: 59.19 ns    p95: 59.23 ns   p99: 59.25 ns
>   (...)
> 
> I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76
> (the latter being very close to Neoverse-N1), and the A76 (the 4 latest
> ones) show the same pattern as the Altra above and are consistently much
> better than the LL/SC one:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.41 ns
>   LSE  :   p50: 4.43 ns     p95: 28.60 ns   p99: 30.29 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.59 ns
>   LSE  :   p50: 4.42 ns     p95: 27.51 ns   p99: 29.46 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.40 ns     p95: 9.40 ns    p99: 9.40 ns
>   LSE  :   p50: 4.42 ns     p95: 27.00 ns   p99: 29.60 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 10.43 ns
>   LSE  :   p50: 8.02 ns     p95: 29.72 ns   p99: 31.05 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.85 ns     p95: 8.86 ns    p99: 8.86 ns
>   LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 5.75 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.85 ns     p95: 8.85 ns    p99: 9.28 ns
>   LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 8.29 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.79 ns     p95: 8.80 ns    p99: 8.80 ns
>   LSE  :   p50: 5.71 ns     p95: 5.71 ns    p99: 5.71 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.80 ns     p95: 8.80 ns    p99: 9.30 ns
>   LSE  :   p50: 5.71 ns     p95: 5.72 ns    p99: 5.72 ns
> 
> Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something
> between the two (and the governor is in performance mode):
> 
>  ./percpu_bench 
> ARM64 Per-CPU Atomic Add Benchmark
> ===================================
> Running percentile measurements (100 iterations)...
> Detected 8 CPUs
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.28 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 19.48 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.26 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 16.30 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.25 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 4.65 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.36 ns
>   LSE  :   p50: 4.63 ns     p95: 19.01 ns   p99: 32.15 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.28 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.45 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.58 ns
>   LSE  :   p50: 4.82 ns     p95: 4.82 ns    p99: 4.83 ns
> 
> So it seems at first glance that LL/SC is generally slower but can be
> more consistent on modern machines, that LSE is stable on older machines
> and can be stable sometimes even on some modern machines.

I guess that I am glad that I am not alone?  ;-)

I am guessing that there is no reasonable way to check for whether a
given system has slow LSE, as would be needed to use ALTERNATIVE(),
but please let me know if I am mistaken.

							Thanx, Paul

> @Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in
> the Xt register (to be honest I've never understood Arm's docs regarding
> instructions, even the pseudo language is super cryptic to me), and I came
> up with this:
> 
>         asm volatile(
>                 /* LSE atomics */
>                 "    ldadd    %[val], %[out], %[ptr]\n"
>                 : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val)
>                 : [val] "r"((u64)(val))
>                 : "memory");
> 
> which assembles like this:
> 
>  ab8:   f8200040        ldadd   x0, x0, [x2]
> 
> It now gives me much better LSE performance on the ARMv9:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 7.32 ns    p99: 8.72 ns
>   LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.77 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
>   LSE  :   p50: 5.09 ns     p95: 5.11 ns    p99: 5.11 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.58 ns    p99: 9.07 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 7.42 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.47 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.42 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 8 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 9 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.05 ns     p95: 7.06 ns    p99: 7.07 ns
>   LSE  :   p50: 2.96 ns     p95: 2.97 ns    p99: 2.97 ns
>   
>    CPU: 10 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.05 ns     p95: 7.05 ns    p99: 7.06 ns
>   LSE  :   p50: 2.96 ns     p95: 2.96 ns    p99: 2.97 ns
>   
>    CPU: 11 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 6.56 ns    p99: 6.57 ns
>   LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.76 ns
> 
> (cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a
> confirmation that my change is correct and that I'm not just doing
> something ignored that tries to add zero :-/
> 
> If that's OK, then it's indeed way better!
> 
> Willy
> 
> PS: thanks Breno for sharing your test code, that's super useful!