[RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations

Thu Nov 6 14:23:49 PST 2025

On Sat, 09 Aug 2025 02:48:41 PDT (-0700), yangyicong at huawei.com wrote:
> On 2025/8/8 19:35, Will Deacon wrote:
>> On Thu, Jul 24, 2025 at 08:06:51PM +0800, Yicong Yang wrote:
>>> From: Yicong Yang <yangyicong at hisilicon.com>
>>>
>>> commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
>>> adds prefetch prior to LL/SC operations due to performance concerns -
>>> change the cacheline status from exclusive could be significant. This is
>>> also true for LSE operations, so prefetch the destination prior to LSE
>>> operations.
>>>
>>> Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
>>> which could stress the spinlock of the futex hash bucket:
>>>                         6.16-rc7 patched
>>> futex/hash(ops/sec)     171843   204757 +19.15%
>>> futex/wake(ms)          0.4630   0.4216 +8.94%
>>> futex/wake-parallel(ms) 0.0048   0.0039 +18.75%
>>> futex/requeue(ms)       0.1487   0.1508 -1.41%
>>> (2nd validation)                 0.1484 +0.2%
>>> futex/lock-pi(ops/sec)  125      126    +0.8%
>>>
>>> For a single wake test for different threads number using `perf bench
>>> -r 100 futex wake -t <threads>`:
>>> threads 6.16-rc7 patched
>>> 1       0.0035   0.0032 +8.57%
>>> 48      0.1454   0.1221 +16.02%
>>> 96      0.3047   0.2304 +24.38%
>>> 160     0.5489   0.5012 +8.69%
>>> 192     0.6675   0.5906 +11.52%
>>> 256     0.9445   0.8092 +14.33%
>>>
>>> There're some variation for close numbers but overall results
>>> look positive.
>>>
>>> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
>>> ---
>>>
>>> RFT for tests and feedbacks since not sure it's general or just the optimization
>>> on some specific implementations.
>>>
>>>  arch/arm64/include/asm/atomic_lse.h | 7 +++++++
>>>  arch/arm64/include/asm/cmpxchg.h    | 3 ++-
>>>  2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> One of the motivations behind rmw instructions (as opposed to ldxr/stxr
>> loops) is so that the atomic operation can be performed at different
>> places in the memory hierarchy depending upon where the data resides.
>>
>> For example, if a shared counter is sitting at a level of system cache,
>> it may be optimal to leave it there so that CPUs around the system can
>> post atomic increments to it without forcing the line up and down the
>> cache hierarchy every time.

A few of us were over here 
https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/ 
talking about similar things.  It doesn't actually have anything to do with
Paul's issue, as that's in percpu, but I happened to have run into some cases
where ATOMIC_ST_NEAR produced better application-level throughput recently and
though I'd poke around here too.

Over there we also found that microbenchmarks report better performance for a
bunch of different flavors of these atomics (LDADD, prefetches, and
ATOMIC_ST_NEAR on my end).  This was true even for the contended cases, which I
found kind of surprising.

I benchmarked this with schbench, and found it's about 10% worse p99 
latency (and also slightly worse at the other tiers).  I see the same 
thing with ATOMIC_ST_NEAR (which IIUC basically just does this in HW).

I also converted everything to LDADD-style routines (ie, not just the percpu
ones).  Those were the best in the microbenchmarks, but they don't show any
difference compared to STADD-style routines.

Here's the code in case anyone's interested, though:

    diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
    index 87f568a94e55..03fddf5fa46f 100644
    --- a/arch/arm64/include/asm/atomic_lse.h
    +++ b/arch/arm64/include/asm/atomic_lse.h
    @@ -14,17 +14,18 @@
     static __always_inline void                                            \
     __lse_atomic_##op(int i, atomic_t *v)                                  \
     {                                                                      \
    +       long tmp;                                                       \
            asm volatile(                                                   \
            __LSE_PREAMBLE                                                  \
    -       "       " #asm_op "     %w[i], %[v]\n"                          \
    -       : [v] "+Q" (v->counter)                                         \
    +       "       " #asm_op "     %w[i], %w[t], %[v]\n"                   \
    +       : [v] "+Q" (v->counter), [t] "=r" (tmp)                         \
            : [i] "r" (i));                                                 \
     }

    -ATOMIC_OP(andnot, stclr)
    -ATOMIC_OP(or, stset)
    -ATOMIC_OP(xor, steor)
    -ATOMIC_OP(add, stadd)
    +ATOMIC_OP(andnot, ldclr)
    +ATOMIC_OP(or, ldset)
    +ATOMIC_OP(xor, ldeor)
    +ATOMIC_OP(add, ldadd)

     static __always_inline void __lse_atomic_sub(int i, atomic_t *v)
     {
    @@ -121,17 +122,18 @@ ATOMIC_FETCH_OP_AND(        , al, "memory")
     static __always_inline void                                            \
     __lse_atomic64_##op(s64 i, atomic64_t *v)                              \
     {                                                                      \
    +       long tmp;                                                       \
            asm volatile(                                                   \
            __LSE_PREAMBLE                                                  \
    -       "       " #asm_op "     %[i], %[v]\n"                           \
    -       : [v] "+Q" (v->counter)                                         \
    +       "       " #asm_op "     %[i], %[t], %[v]\n"                     \
    +       : [v] "+Q" (v->counter), [t] "=r" (tmp)                         \
            : [i] "r" (i));                                                 \
     }

    -ATOMIC64_OP(andnot, stclr)
    -ATOMIC64_OP(or, stset)
    -ATOMIC64_OP(xor, steor)
    -ATOMIC64_OP(add, stadd)
    +ATOMIC64_OP(andnot, ldclr)
    +ATOMIC64_OP(or, ldset)
    +ATOMIC64_OP(xor, ldeor)
    +ATOMIC64_OP(add, ldadd)

     static __always_inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
     {

> yes it's true. for a CHI based system the atomic can be implemented in
> the cpu (RN-F) which is termed as near atomic and outside the cpu on the
> system component (system cache, etc) which is termed as far atomic [1].
> the above example should refer to the far atomic and the atomic operations
> don't need to be finished in the cpu cache.
>
> [1] https://developer.arm.com/documentation/102714/0100/Atomic-fundamentals
>
>>
>> So, although adding an L1 prefetch may help some specific benchmarks on
>> a specific system, I don't think this is generally a good idea for
>> scalability. The hardware should be able to figure out the best place to
>> do the operation and, if you have a system where that means it should
>> always be performed within the CPU, then you should probably configure
>> it not to send the atomic remotely rather than force that in the kernel
>> for everybody.
>>
>
> the prefetch may not be benefit for the far atomic since the atomic operation
> is not doned in the cpu cache, but will help those system implemented as near
> atomic since it can load the data into the cpu cache prior to the atomic
> operations. So alternatively, instead of enabling this all the time is it
> acceptable to make this a kconfig/cmdline option as an optimization for near
> atomic systems then those users can benefit from this?
>
> Thanks.