Overhead of arm64 LSE per-CPU atomics?

Wed Nov 5 13:13:10 PST 2025

On Wed, 05 Nov 2025 11:16:42 PST (-0800), Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote:
>> On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
>> > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
>> > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
>> > > > Given that this_cpu_*() are meant for the local CPU, there's less risk
>> > > > of cache line bouncing between CPUs, so I'm happy to change them to
>> > > > either use PRFM or LDADD (I think I prefer the latter). This would not
>> > > > be a generic change for the other atomics, only the per-CPU ones.
>> > >
>> > > I have easy access to only the one type of ARM system, and of course
>> > > the choice must be driven by a wide range of systems.  But yes, it
>> > > would be much better if we can just use this_cpu_inc().  I will use the
>> > > non-atomics protected by interrupt disabling in the meantime, but look
>> > > forward to being able to switch back.
>> >
>> > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
>> > or just in a microbenchmark hammering them? From what I understand from
>> > the hardware folk, doing STADD in a loop saturates some queues in the
>> > interconnect and slows down eventually. In normal use, it's just a
>> > posted operation not affecting the subsequent instructions (or at least
>> > that's the theory).
>>
>> Only in a microbenchmark, and Breno did not find any issues in larger
>> benchmarks, so good to hear!

FWIW, I have a proxy workload where enabling ATOMIC_*_FORCE_NEAR is ~1% 
better (at application-level throughput).  It's supposed to be 
representative of real workloads and isn't supposed to have contention, 
but I don't trust these workloads at all so take that with a grain of 
salt...

I still had looking into this on my TODO list, I was planning on doing 
it all internally as a tuning thing but LMK if folks think it's 
interesting and I'll try to find some way to talk about it publicly.

>> Now, some non-arm64 systems deal with it just fine, but perhaps I owe
>> everyone an apology for the firedrill.
>
> That was a useful exercise, I learnt more things about the arm atomics.
>
>> But let me put it this way...  Would you ack an SRCU patch that resulted
>> in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
>> other systems?
>
> Only if it's backed by other microbenchmarks showing significant
> improvements ;).
>
> I think we should change the percpu atomics, it makes more sense to do
> them near, but I'll keep the others as they are. Planning to post a

I guess I kind of went down a rabbit hole here, but I think I found some 
interesting stuff.  This is all based on some modifications of Breno's 
microbenchmark to add two things:

* A contending thread, which performs the same operation on the same 
  counter in a loop, with operations separated by a variable-counted 
  loop of NOPs.
* Some busy work for the timed thread, which is also just a loop of 
  NOPs.

Those loops look like

    for (d = 0; d < duty; d++)
        __asm__ volatile ("nop");

in the code and get compiled to

                            for (d = 0; d < duty; d++)
      41037c:       f90007ff        str     xzr, [sp, #8]
      410380:       14000001        b       410384 <run_core_benchmark+0x74>
      410384:       f94007e8        ldr     x8, [sp, #8]
      410388:       f85e03a9        ldur    x9, [x29, #-32]
      41038c:       eb090108        subs    x8, x8, x9
      410390:       54000102        b.cs    4103b0 <run_core_benchmark+0xa0>  // b.hs, b.nlast
      410394:       14000001        b       410398 <run_core_benchmark+0x88>
                                    __asm__ volatile ("nop");
      410398:       d503201f        nop
      41039c:       14000001        b       4103a0 <run_core_benchmark+0x90>
                            for (d = 0; d < duty; d++)
      4103a0:       f94007e8        ldr     x8, [sp, #8]
      4103a4:       91000508        add     x8, x8, #0x1
      4103a8:       f90007e8        str     x8, [sp, #8]
      4103ac:       17fffff6        b       410384 <run_core_benchmark+0x74>
                    }

which is I guess kind of wacky generated code, but is maybe a reasonable 
proxy for work -- it's got load/stores/branches, which IIUC is what real 
code does ;)

I ran a bunch of cases with those:

 CPU: 0 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 063.65 ns        p95: 065.02 ns          p99: 065.32 ns
LSE (stadd)     (c                0, d              100):   p50: 063.71 ns        p95: 064.96 ns          p99: 065.68 ns
LSE (stadd)     (c                0, d              200):   p50: 068.07 ns        p95: 082.98 ns          p99: 083.24 ns
LSE (stadd)     (c                0, d              300):   p50: 098.96 ns        p95: 121.14 ns          p99: 122.04 ns
LSE (stadd)     (c               10, d                0):   p50: 115.33 ns        p95: 117.25 ns          p99: 117.35 ns
LSE (stadd)     (c               10, d              300):   p50: 115.30 ns        p95: 119.12 ns          p99: 121.68 ns
LSE (stadd)     (c               10, d              500):   p50: 162.94 ns        p95: 185.24 ns          p99: 195.79 ns
LSE (stadd)     (c               30, d                0):   p50: 115.17 ns        p95: 117.14 ns          p99: 117.84 ns
LSE (stadd)     (c              100, d                0):   p50: 115.17 ns        p95: 117.13 ns          p99: 117.35 ns
LSE (stadd)     (c            10000, d                0):   p50: 064.81 ns        p95: 066.24 ns          p99: 067.08 ns
LL/SC           (c                0, d                0):   p50: 005.66 ns        p95: 006.45 ns          p99: 006.47 ns
LL/SC           (c                0, d               10):   p50: 006.19 ns        p95: 006.98 ns          p99: 007.01 ns
LL/SC           (c                0, d               20):   p50: 007.35 ns        p95: 008.88 ns          p99: 009.46 ns
LL/SC           (c               10, d                0):   p50: 164.16 ns        p95: 462.97 ns          p99: 580.92 ns
LL/SC           (c               10, d               10):   p50: 303.22 ns        p95: 575.03 ns          p99: 609.62 ns
LL/SC           (c               10, d               20):   p50: 032.24 ns        p95: 042.03 ns          p99: 048.71 ns
LL/SC           (c             1000, d                0):   p50: 017.37 ns        p95: 018.18 ns          p99: 018.19 ns
LL/SC           (c             1000, d               10):   p50: 019.54 ns        p95: 020.37 ns          p99: 021.79 ns
LL/SC           (c          1000000, d                0):   p50: 015.46 ns        p95: 017.00 ns          p99: 017.25 ns
LL/SC           (c          1000000, d               10):   p50: 017.57 ns        p95: 019.16 ns          p99: 019.47 ns
LDADD           (c                0, d                0):   p50: 004.33 ns        p95: 004.64 ns          p99: 005.13 ns
LDADD           (c                0, d              100):   p50: 032.15 ns        p95: 040.29 ns          p99: 040.69 ns
LDADD           (c                0, d              200):   p50: 067.97 ns        p95: 083.04 ns          p99: 083.30 ns
LDADD           (c                0, d              300):   p50: 098.93 ns        p95: 120.79 ns          p99: 122.52 ns
LDADD           (c                1, d              100):   p50: 049.19 ns        p95: 072.23 ns          p99: 072.38 ns
LDADD           (c                1, d              200):   p50: 143.15 ns        p95: 145.34 ns          p99: 145.90 ns
LDADD           (c                1, d              300):   p50: 153.91 ns        p95: 162.57 ns          p99: 163.84 ns
LDADD           (c               10, d                0):   p50: 012.46 ns        p95: 013.24 ns          p99: 014.33 ns
LDADD           (c               10, d              100):   p50: 049.34 ns        p95: 069.35 ns          p99: 070.71 ns
LDADD           (c               10, d              200):   p50: 141.66 ns        p95: 143.65 ns          p99: 144.31 ns
LDADD           (c               10, d              300):   p50: 152.82 ns        p95: 163.51 ns          p99: 164.03 ns
LDADD           (c              100, d                0):   p50: 012.37 ns        p95: 013.23 ns          p99: 014.52 ns
LDADD           (c              100, d               10):   p50: 014.32 ns        p95: 015.11 ns          p99: 015.15 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 003.97 ns        p95: 005.23 ns          p99: 005.49 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 126.02 ns        p95: 127.72 ns          p99: 128.72 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 021.97 ns        p95: 023.93 ns          p99: 024.97 ns
PFRM_KEEP+STADD (c          1000000, d              100):   p50: 076.28 ns        p95: 080.88 ns          p99: 081.50 ns
PFRM_KEEP+STADD (c          1000000, d              200):   p50: 089.62 ns        p95: 091.49 ns          p99: 091.89 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 003.97 ns        p95: 005.23 ns          p99: 005.47 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 126.75 ns        p95: 128.96 ns          p99: 129.48 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 021.83 ns        p95: 023.75 ns          p99: 023.96 ns
PFRM_STRM+STADD (c          1000000, d              100):   p50: 074.48 ns        p95: 079.56 ns          p99: 080.73 ns
PFRM_STRM+STADD (c          1000000, d              200):   p50: 089.76 ns        p95: 091.14 ns          p99: 092.46 ns

Which I'm interpreting to say the following:

* LL/SC is pretty good for the common cases, but gets really bad under 
  the pathological cases.  It still seems always slower that LDADD.
* STADD has latency that blocks other STADDs, but not other CPU-local 
  work.  I'd bet there's a bunch of interactions with caches and memory 
  ordering here, but those would all juts make STADD look worse so I'm 
  just ignoring them.
* LDADD is better than STADD even under pathologically highly contended 
  cases.  I was actually kind of surprised about this one, I thought the 
  far atomics would be better there.
* The prefetches help STADD, but they don't seem to make it better that 
  LDADD in any case.
* The LDADD latency also happens concurrently with other CPU operations 
  like the STADD latency does.  It has less latency to hide, so the 
  latency starts to go up with less extra work, but it's never worse 
  that STADD.

So I think at least on this system, LDADD is just always better.

[My code's up in a PR to Breno's repo: 
https://github.com/leitao/debug/pull/2]

> proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
> all day). Something like below but with more comments and a commit log:
>
> ------------------------8<--------------------------
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index 9abcc8ef3087..d4dff4b0cf50 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>  	"	stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n"		\
>  	"	cbnz	%w[loop], 1b",					\
>  	/* LSE atomics */						\
> -		#op_lse "\t%" #w "[val], %[ptr]\n"			\
> +		#op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n"	\
>  		__nops(3))						\
>  	: [loop] "=&r" (loop), [tmp] "=&r" (tmp),			\
>  	  [ptr] "+Q"(*(u##sz *)ptr)					\
> @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8)
>  PERCPU_RW_OPS(16)
>  PERCPU_RW_OPS(32)
>  PERCPU_RW_OPS(64)
> -PERCPU_OP(add, add, stadd)
> -PERCPU_OP(andnot, bic, stclr)
> -PERCPU_OP(or, orr, stset)
> +PERCPU_OP(add, add, ldadd)
> +PERCPU_OP(andnot, bic, ldclr)
> +PERCPU_OP(or, orr, ldset)
>  PERCPU_RET_OP(add, add, ldadd)
>
>  #undef PERCPU_RW_OPS