[PATCH] arm64: Use load LSE atomics for the non-return per-CPU atomic operations
Palmer Dabbelt
palmer at dabbelt.com
Thu Nov 6 08:30:04 PST 2025
On Thu, 06 Nov 2025 07:52:13 PST (-0800), Catalin Marinas wrote:
> The non-return per-CPU this_cpu_*() atomic operations are implemented as
> STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture
> implementations, these instructions tend to be executed "far" in the
> interconnect or memory subsystem (unless the data is already in the L1
> cache). This is in general more efficient when there is contention as it
> avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD
> without XZR as destination), OTOH, tend to be executed "near" with the
> data loaded into the L1 cache.
>
> STADD executed back to back as in srcu_read_{lock,unlock}*() incur an
> additional overhead due to the default posting behaviour on several CPU
> implementations. Since the per-CPU atomics are unlikely to be used
> concurrently on the same memory location, encourage the hardware to to
> execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with
> the destination register unused (but not XZR).
>
> Signed-off-by: Catalin Marinas <catalin.marinas at arm.com>
> Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
> Reported-by: Paul E. McKenney <paulmck at kernel.org>
> Tested-by: Paul E. McKenney <paulmck at kernel.org>
> Cc: Will Deacon <will at kernel.org>
Reviewed-by: Palmer Dabbelt <palmer at dabbelt.com>
> ---
> arch/arm64/include/asm/percpu.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index 9abcc8ef3087..d4dff4b0cf50 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
> " stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \
> " cbnz %w[loop], 1b", \
> /* LSE atomics */ \
> - #op_lse "\t%" #w "[val], %[ptr]\n" \
> + #op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n" \
> __nops(3)) \
> : [loop] "=&r" (loop), [tmp] "=&r" (tmp), \
> [ptr] "+Q"(*(u##sz *)ptr) \
> @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8)
> PERCPU_RW_OPS(16)
> PERCPU_RW_OPS(32)
> PERCPU_RW_OPS(64)
> -PERCPU_OP(add, add, stadd)
> -PERCPU_OP(andnot, bic, stclr)
> -PERCPU_OP(or, orr, stset)
> +PERCPU_OP(add, add, ldadd)
> +PERCPU_OP(andnot, bic, ldclr)
> +PERCPU_OP(or, orr, ldset)
> PERCPU_RET_OP(add, add, ldadd)
>
> #undef PERCPU_RW_OPS
More information about the linux-arm-kernel
mailing list