Overhead of arm64 LSE per-CPU atomics?

Tue Nov 4 10:08:19 PST 2025

Hello Breno,

On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> while LL/SC case is stable.
> In some case, LSE function runs at the same latency as LL/SC function and
> slightly faster on p50, but, something happen to the system and LSE operations
> start to take way longer than LL/SC.
> 
> Here are some interesting output coming from the latency of the functions above>
> 
> 	CPU: 47 - Latency Percentiles:
> 	====================
> 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
(...)

Very interesting. I've run them here on a 80-core Ampere Altra made
of Neoverse-N1 (armv8.2) and am getting very consistently better timings
with LSE than LL/SC:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns

   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns

   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.02 ns
  (...)

They're *all* like this, between 7.32 and 7.36 for LL/SC p99,
and 5.01 to 5.03 for LSE p99.

However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've
observed, i.e. a lot of variations that do not even depend on big
vs little cores:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.13 ns    p99: 8.81 ns
  LSE  :   p50: 45.79 ns    p95: 45.80 ns   p99: 45.86 ns

   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 67.72 ns    p95: 67.78 ns   p99: 67.80 ns

   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 59.19 ns    p95: 59.23 ns   p99: 59.25 ns
  (...)

I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76
(the latter being very close to Neoverse-N1), and the A76 (the 4 latest
ones) show the same pattern as the Altra above and are consistently much
better than the LL/SC one:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.41 ns
  LSE  :   p50: 4.43 ns     p95: 28.60 ns   p99: 30.29 ns

   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.59 ns
  LSE  :   p50: 4.42 ns     p95: 27.51 ns   p99: 29.46 ns

   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.40 ns     p95: 9.40 ns    p99: 9.40 ns
  LSE  :   p50: 4.42 ns     p95: 27.00 ns   p99: 29.60 ns

   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 10.43 ns
  LSE  :   p50: 8.02 ns     p95: 29.72 ns   p99: 31.05 ns

   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.86 ns    p99: 8.86 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 5.75 ns

   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.85 ns    p99: 9.28 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 8.29 ns

   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.79 ns     p95: 8.80 ns    p99: 8.80 ns
  LSE  :   p50: 5.71 ns     p95: 5.71 ns    p99: 5.71 ns

   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.80 ns     p95: 8.80 ns    p99: 9.30 ns
  LSE  :   p50: 5.71 ns     p95: 5.72 ns    p99: 5.72 ns

Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something
between the two (and the governor is in performance mode):

 ./percpu_bench 
ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 8 CPUs

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.28 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 19.48 ns

   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.26 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 16.30 ns

   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.25 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 4.65 ns

   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.36 ns
  LSE  :   p50: 4.63 ns     p95: 19.01 ns   p99: 32.15 ns

   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns

   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns

   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.28 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.45 ns

   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.58 ns
  LSE  :   p50: 4.82 ns     p95: 4.82 ns    p99: 4.83 ns

So it seems at first glance that LL/SC is generally slower but can be
more consistent on modern machines, that LSE is stable on older machines
and can be stable sometimes even on some modern machines.

@Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in
the Xt register (to be honest I've never understood Arm's docs regarding
instructions, even the pseudo language is super cryptic to me), and I came
up with this:

        asm volatile(
                /* LSE atomics */
                "    ldadd    %[val], %[out], %[ptr]\n"
                : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val)
                : [val] "r"((u64)(val))
                : "memory");

which assembles like this:

 ab8:   f8200040        ldadd   x0, x0, [x2]

It now gives me much better LSE performance on the ARMv9:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.32 ns    p99: 8.72 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.77 ns

   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 5.09 ns     p95: 5.11 ns    p99: 5.11 ns

   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.58 ns    p99: 9.07 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns

   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 7.42 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns

   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.47 ns

   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns

   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.42 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns

   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns

   CPU: 8 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns

   CPU: 9 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.06 ns    p99: 7.07 ns
  LSE  :   p50: 2.96 ns     p95: 2.97 ns    p99: 2.97 ns

   CPU: 10 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.05 ns    p99: 7.06 ns
  LSE  :   p50: 2.96 ns     p95: 2.96 ns    p99: 2.97 ns

   CPU: 11 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 6.56 ns    p99: 6.57 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.76 ns

(cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a
confirmation that my change is correct and that I'm not just doing
something ignored that tries to add zero :-/

If that's OK, then it's indeed way better!

Willy

PS: thanks Breno for sharing your test code, that's super useful!