[RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations
Palmer Dabbelt
palmer at dabbelt.com
Thu Nov 6 14:23:49 PST 2025
On Sat, 09 Aug 2025 02:48:41 PDT (-0700), yangyicong at huawei.com wrote:
> On 2025/8/8 19:35, Will Deacon wrote:
>> On Thu, Jul 24, 2025 at 08:06:51PM +0800, Yicong Yang wrote:
>>> From: Yicong Yang <yangyicong at hisilicon.com>
>>>
>>> commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
>>> adds prefetch prior to LL/SC operations due to performance concerns -
>>> change the cacheline status from exclusive could be significant. This is
>>> also true for LSE operations, so prefetch the destination prior to LSE
>>> operations.
>>>
>>> Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
>>> which could stress the spinlock of the futex hash bucket:
>>> 6.16-rc7 patched
>>> futex/hash(ops/sec) 171843 204757 +19.15%
>>> futex/wake(ms) 0.4630 0.4216 +8.94%
>>> futex/wake-parallel(ms) 0.0048 0.0039 +18.75%
>>> futex/requeue(ms) 0.1487 0.1508 -1.41%
>>> (2nd validation) 0.1484 +0.2%
>>> futex/lock-pi(ops/sec) 125 126 +0.8%
>>>
>>> For a single wake test for different threads number using `perf bench
>>> -r 100 futex wake -t <threads>`:
>>> threads 6.16-rc7 patched
>>> 1 0.0035 0.0032 +8.57%
>>> 48 0.1454 0.1221 +16.02%
>>> 96 0.3047 0.2304 +24.38%
>>> 160 0.5489 0.5012 +8.69%
>>> 192 0.6675 0.5906 +11.52%
>>> 256 0.9445 0.8092 +14.33%
>>>
>>> There're some variation for close numbers but overall results
>>> look positive.
>>>
>>> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
>>> ---
>>>
>>> RFT for tests and feedbacks since not sure it's general or just the optimization
>>> on some specific implementations.
>>>
>>> arch/arm64/include/asm/atomic_lse.h | 7 +++++++
>>> arch/arm64/include/asm/cmpxchg.h | 3 ++-
>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> One of the motivations behind rmw instructions (as opposed to ldxr/stxr
>> loops) is so that the atomic operation can be performed at different
>> places in the memory hierarchy depending upon where the data resides.
>>
>> For example, if a shared counter is sitting at a level of system cache,
>> it may be optimal to leave it there so that CPUs around the system can
>> post atomic increments to it without forcing the line up and down the
>> cache hierarchy every time.
A few of us were over here
https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
talking about similar things. It doesn't actually have anything to do with
Paul's issue, as that's in percpu, but I happened to have run into some cases
where ATOMIC_ST_NEAR produced better application-level throughput recently and
though I'd poke around here too.
Over there we also found that microbenchmarks report better performance for a
bunch of different flavors of these atomics (LDADD, prefetches, and
ATOMIC_ST_NEAR on my end). This was true even for the contended cases, which I
found kind of surprising.
I benchmarked this with schbench, and found it's about 10% worse p99
latency (and also slightly worse at the other tiers). I see the same
thing with ATOMIC_ST_NEAR (which IIUC basically just does this in HW).
I also converted everything to LDADD-style routines (ie, not just the percpu
ones). Those were the best in the microbenchmarks, but they don't show any
difference compared to STADD-style routines.
Here's the code in case anyone's interested, though:
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 87f568a94e55..03fddf5fa46f 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -14,17 +14,18 @@
static __always_inline void \
__lse_atomic_##op(int i, atomic_t *v) \
{ \
+ long tmp; \
asm volatile( \
__LSE_PREAMBLE \
- " " #asm_op " %w[i], %[v]\n" \
- : [v] "+Q" (v->counter) \
+ " " #asm_op " %w[i], %w[t], %[v]\n" \
+ : [v] "+Q" (v->counter), [t] "=r" (tmp) \
: [i] "r" (i)); \
}
-ATOMIC_OP(andnot, stclr)
-ATOMIC_OP(or, stset)
-ATOMIC_OP(xor, steor)
-ATOMIC_OP(add, stadd)
+ATOMIC_OP(andnot, ldclr)
+ATOMIC_OP(or, ldset)
+ATOMIC_OP(xor, ldeor)
+ATOMIC_OP(add, ldadd)
static __always_inline void __lse_atomic_sub(int i, atomic_t *v)
{
@@ -121,17 +122,18 @@ ATOMIC_FETCH_OP_AND( , al, "memory")
static __always_inline void \
__lse_atomic64_##op(s64 i, atomic64_t *v) \
{ \
+ long tmp; \
asm volatile( \
__LSE_PREAMBLE \
- " " #asm_op " %[i], %[v]\n" \
- : [v] "+Q" (v->counter) \
+ " " #asm_op " %[i], %[t], %[v]\n" \
+ : [v] "+Q" (v->counter), [t] "=r" (tmp) \
: [i] "r" (i)); \
}
-ATOMIC64_OP(andnot, stclr)
-ATOMIC64_OP(or, stset)
-ATOMIC64_OP(xor, steor)
-ATOMIC64_OP(add, stadd)
+ATOMIC64_OP(andnot, ldclr)
+ATOMIC64_OP(or, ldset)
+ATOMIC64_OP(xor, ldeor)
+ATOMIC64_OP(add, ldadd)
static __always_inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
{
> yes it's true. for a CHI based system the atomic can be implemented in
> the cpu (RN-F) which is termed as near atomic and outside the cpu on the
> system component (system cache, etc) which is termed as far atomic [1].
> the above example should refer to the far atomic and the atomic operations
> don't need to be finished in the cpu cache.
>
> [1] https://developer.arm.com/documentation/102714/0100/Atomic-fundamentals
>
>>
>> So, although adding an L1 prefetch may help some specific benchmarks on
>> a specific system, I don't think this is generally a good idea for
>> scalability. The hardware should be able to figure out the best place to
>> do the operation and, if you have a system where that means it should
>> always be performed within the CPU, then you should probably configure
>> it not to send the atomic remotely rather than force that in the kernel
>> for everybody.
>>
>
> the prefetch may not be benefit for the far atomic since the atomic operation
> is not doned in the cpu cache, but will help those system implemented as near
> atomic since it can load the data into the cpu cache prior to the atomic
> operations. So alternatively, instead of enabling this all the time is it
> acceptable to make this a kconfig/cmdline option as an optimization for near
> atomic systems then those users can benefit from this?
>
> Thanks.
More information about the linux-arm-kernel
mailing list