Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 08:21:32 PST 2025

On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > > But need to add the prefetch in per-cpu implementation as you've
> > > > noticed above (didn't add it since no prefetch for LL/SC
> > > > implementation there, maybe a missing?)
> > > 
> > > Maybe no-one stressed these to notice any difference between LL/SC and
> > > LSE.
> > 
> > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> > having faced catastrophic performance there on haproxy, while with LSE
> > it continues to scale almost linearly at least till 64.
> 
> I was referring only to the this_cpu_add() etc. functions (until Paul
> started using them). There definitely have been lots of benchmarks on
> the scalability of LL/SC. That's one of the reasons Arm added the LSE
> atomics years ago.
> 
> > But that does
> > not mean that if some possibilities are within reach to recover 90% of
> > the atomic overhead in uncontended case we shouldn't try to grab it at
> > a reasonable cost!
> 
> I agree. Even for these cases, I don't think the solution is LL/SC but
> rather better use of LSE (and better understanding of the hardware
> behaviour; feedback here should go both ways).
> 
> > I'm definitely adding in my todo list to experiment more on this on
> > various CPUs now ;-)
> 
> Thanks for the tests so far, very insightful. I think what's still
> good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
> Breno's microbenchmarks. I suspect LDADD is still better.

I've hacked my microbenchmark to add these tests Catalin suggested, and it seems prfm improve the latency variation.

This is what I am measuring now:

	/* LL/SC implementation */
	void __percpu_add_case_64_llsc(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LL/SC */
		"1:  ldxr    %[tmp], %[ptr]\n"
		"    add     %[tmp], %[tmp], %[val]\n"
		"    stxr    %w[loop], %[tmp], %[ptr]\n"
		"    cbnz    %w[loop], 1b"
		: [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using stadd */
	void __percpu_add_case_64_lse(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LSE atomics */
		"    stadd    %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using ldadd */
	void __percpu_add_case_64_ldadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LSE atomics */
		"    ldadd    %[val], %[tmp], %[ptr]\n"
		: [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using PRFM + stadd */
	void __percpu_add_case_64_prfm_stadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* Prefetch + LSE atomics */
		"    prfm    pstl1keep, %[ptr]\n"
		"    stadd   %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using PRFM STRM + stadd */
	void __percpu_add_case_64_prfm_strm_stadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* Prefetch streaming + LSE atomics */
		"    prfm    pstl1strm, %[ptr]\n"
		"    stadd   %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

And prfm definitely added some stabilityu to STDADD, but, in most cases, it is
still a bit behind the regular ldxr/stxr.

	CPU: 0 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.73 ns      p95: 5.90 ns      p99: 7.35 ns
	STADD           :   p50: 65.99 ns     p95: 68.98 ns     p99: 70.13 ns
	LDADD           :   p50: 4.33 ns      p95: 4.34 ns      p99: 4.34 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 7.91 ns      p99: 8.82 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 8.11 ns      p99: 9.76 ns

	CPU: 1 - Latency Percentiles:
	====================
	LL/SC           :   p50: 7.72 ns      p95: 18.00 ns      p99: 31.51 ns
	STADD           :   p50: 103.81 ns    p95: 127.60 ns     p99: 137.12 ns
	LDADD           :   p50: 4.35 ns      p95: 22.46 ns      p99: 25.03 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 22.04 ns      p99: 23.66 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 8.75 ns       p99: 11.10 ns

	CPU: 2 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.73 ns      p95: 6.87 ns      p99: 23.96 ns
	STADD           :   p50: 63.30 ns      p95: 63.33 ns    p99: 63.36 ns
	LDADD           :   p50: 4.34 ns      p95: 4.35 ns      p99: 4.35 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 7.90 ns      p99: 7.91 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 7.90 ns      p99: 7.90 ns

	CPU: 3 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.70 ns      p95: 5.71 ns      p99: 5.72 ns
	STADD           :   p50: 61.94 ns     p95: 62.95 ns     p99: 65.05 ns
	LDADD           :   p50: 4.32 ns      p95: 4.33 ns      p99: 7.28 ns
	PRFM_KEEP+STADD :   p50: 7.86 ns      p95: 7.87 ns      p99: 8.08 ns
	PRFM_STRM+STADD :   p50: 7.86 ns      p95: 7.87 ns      p99: 8.25 ns

	CPU: 4 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.72 ns      p95: 5.73 ns      p99: 5.74 ns
	STADD           :   p50: 62.04 ns     p95: 122.78 ns    p99: 131.43 ns
	LDADD           :   p50: 8.08 ns      p95: 11.70 ns     p99: 14.89 ns
	PRFM_KEEP+STADD :   p50: 13.83 ns     p95: 20.70 ns     p99: 22.54 ns
	PRFM_STRM+STADD :   p50: 12.80 ns     p95: 19.42 ns     p99: 20.36 ns

	CPU: 5 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.68 ns      p95: 5.70 ns      p99: 5.70 ns
	STADD           :   p50: 59.30 ns     p95: 60.52 ns     p99: 66.53 ns
	LDADD           :   p50: 4.30 ns      p95: 4.31 ns      p99: 4.32 ns
	PRFM_KEEP+STADD :   p50: 7.84 ns      p95: 7.85 ns      p99: 7.85 ns
	PRFM_STRM+STADD :   p50: 7.84 ns      p95: 7.85 ns      p99: 7.85 ns

	CPU: 6 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.70 ns      p95: 5.71 ns      p99: 5.72 ns
	STADD           :   p50: 59.37 ns     p95: 59.41 ns     p99: 59.42 ns
	LDADD           :   p50: 4.32 ns      p95: 4.32 ns      p99: 4.34 ns
	PRFM_KEEP+STADD :   p50: 7.85 ns      p95: 7.86 ns      p99: 7.88 ns
	PRFM_STRM+STADD :   p50: 7.85 ns      p95: 7.86 ns      p99: 7.86 ns

	CPU: 7 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.72 ns      p95: 5.74 ns      p99: 6.90 ns
	STADD           :   p50: 64.46 ns     p95: 74.34 ns     p99: 77.47 ns
	LDADD           :   p50: 4.35 ns      p95: 7.50 ns      p99: 10.06 ns
	PRFM_KEEP+STADD :   p50: 8.92 ns      p95: 14.34 ns     p99: 17.31 ns
	PRFM_STRM+STADD :   p50: 8.88 ns      p95: 13.74 ns     p99: 15.11 ns

As always, the code could be found at
https://github.com/leitao/debug/tree/main/LSE