[PATCH] arm64: remove HAVE_CMPXCHG_LOCAL
Dev Jain
dev.jain at arm.com
Mon Feb 16 07:29:17 PST 2026
On 16/02/26 4:30 pm, Will Deacon wrote:
> On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote:
>> It turns out the generic disable/enable irq this_cpu_cmpxchg
>> implementation is faster than LL/SC or lse implementation. Remove
>> HAVE_CMPXCHG_LOCAL for better performance on arm64.
>>
>> Tested on Quad 1.9GHZ CA55 platform:
>> average mod_node_page_state() cost decreases from 167ns to 103ns
>> the spawn (30 duration) benchmark in unixbench is improved
>> from 147494 lps to 150561 lps, improved by 2.1%
>>
>> Tested on Quad 2.1GHZ CA73 platform:
>> average mod_node_page_state() cost decreases from 113ns to 85ns
>> the spawn (30 duration) benchmark in unixbench is improved
>> from 209844 lps to 212581 lps, improved by 1.3%
>>
>> Signed-off-by: Jisheng Zhang <jszhang at kernel.org>
>> ---
>> arch/arm64/Kconfig | 1 -
>> arch/arm64/include/asm/percpu.h | 24 ------------------------
>> 2 files changed, 25 deletions(-)
> That is _entirely_ dependent on the system, so this isn't the right
> approach. I also don't think it's something we particularly want to
> micro-optimise to accomodate systems that suck at atomics.
Hi Will,
As I mention in the other email, the suspect is not the atomics, but
preempt_disable(). On Apple M3, the regression reported in [1] resolves
by removing preempt_disable/enable in _pcp_protect_return. To prove
this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the
regression worsened, indicating that at least on Apple M3 the
atomics are faster.
It may help to confirm this hypothesis on other hardware - perhaps
Jisheng can test with this change on his hardware and confirm
whether he gets the same performance improvement.
By coincidence, Yang Shi has been discussing the this_cpu_* overhead
at [2].
[1] https://lore.kernel.org/all/1052a452-9ba3-4da7-be47-7d27d27b3d1d@arm.com/
[2] https://lore.kernel.org/all/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
>
> Will
>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 38dba5f7e4d2..5e7e2e65d5a5 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -205,7 +205,6 @@ config ARM64
>> select HAVE_EBPF_JIT
>> select HAVE_C_RECORDMCOUNT
>> select HAVE_CMPXCHG_DOUBLE
>> - select HAVE_CMPXCHG_LOCAL
>> select HAVE_CONTEXT_TRACKING_USER
>> select HAVE_DEBUG_KMEMLEAK
>> select HAVE_DMA_CONTIGUOUS
>> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
>> index b57b2bb00967..70ffe566cb4b 100644
>> --- a/arch/arm64/include/asm/percpu.h
>> +++ b/arch/arm64/include/asm/percpu.h
>> @@ -232,30 +232,6 @@ PERCPU_RET_OP(add, add, ldadd)
>> #define this_cpu_xchg_8(pcp, val) \
>> _pcp_protect_return(xchg_relaxed, pcp, val)
>>
>> -#define this_cpu_cmpxchg_1(pcp, o, n) \
>> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> -#define this_cpu_cmpxchg_2(pcp, o, n) \
>> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> -#define this_cpu_cmpxchg_4(pcp, o, n) \
>> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> -#define this_cpu_cmpxchg_8(pcp, o, n) \
>> - _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> -
>> -#define this_cpu_cmpxchg64(pcp, o, n) this_cpu_cmpxchg_8(pcp, o, n)
>> -
>> -#define this_cpu_cmpxchg128(pcp, o, n) \
>> -({ \
>> - typedef typeof(pcp) pcp_op_T__; \
>> - u128 old__, new__, ret__; \
>> - pcp_op_T__ *ptr__; \
>> - old__ = o; \
>> - new__ = n; \
>> - preempt_disable_notrace(); \
>> - ptr__ = raw_cpu_ptr(&(pcp)); \
>> - ret__ = cmpxchg128_local((void *)ptr__, old__, new__); \
>> - preempt_enable_notrace(); \
>> - ret__; \
>> -})
>>
>> #ifdef __KVM_NVHE_HYPERVISOR__
>> extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
>> --
>> 2.51.0
>>
More information about the linux-arm-kernel
mailing list