Overhead of arm64 LSE per-CPU atomics?
Breno Leitao
leitao at debian.org
Tue Nov 4 10:22:46 PST 2025
On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote:
> Hello Breno,
>
> On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> > while LL/SC case is stable.
> > In some case, LSE function runs at the same latency as LL/SC function and
> > slightly faster on p50, but, something happen to the system and LSE operations
> > start to take way longer than LL/SC.
> >
> > Here are some interesting output coming from the latency of the functions above>
> >
> > CPU: 47 - Latency Percentiles:
> > ====================
> > LL/SC: p50: 5.69 ns p95: 5.71 ns p99: 5.80 ns
> > LSE : p50: 45.53 ns p95: 54.06 ns p99: 55.18 ns
> (...)
>
> Very interesting. I've run them here on a 80-core Ampere Altra made
> of Neoverse-N1 (armv8.2) and am getting very consistently better timings
> with LSE than LL/SC:
<snip>
> It now gives me much better LSE performance on the ARMv9:
I also see a stable latency for ldadd in my test case, also, better than LL/SC.
CPU: 0 - Latency Percentiles:
====================
LL/SC: p50: 5.74 ns p95: 5.81 ns p99: 7.13 ns
LSE : p50: 4.34 ns p95: 4.36 ns p99: 4.40 ns
CPU: 1 - Latency Percentiles:
====================
LL/SC: p50: 5.74 ns p95: 5.77 ns p99: 5.82 ns
LSE : p50: 4.35 ns p95: 4.37 ns p99: 4.42 ns
CPU: 2 - Latency Percentiles:
====================
LL/SC: p50: 5.74 ns p95: 5.81 ns p99: 6.76 ns
LSE : p50: 4.35 ns p95: 4.80 ns p99: 5.55 ns
...
CPU: 71 - Latency Percentiles:
====================
LL/SC: p50: 5.72 ns p95: 5.75 ns p99: 5.91 ns
LSE : p50: 4.33 ns p95: 4.35 ns p99: 4.38 ns
> PS: thanks Breno for sharing your test code, that's super useful!
Glad you liked it. I tried to narrow down the problem as much as I could, so, I
could could follow up the discussion. :-)
More information about the linux-arm-kernel
mailing list