Overhead of arm64 LSE per-CPU atomics?
Palmer Dabbelt
palmer at dabbelt.com
Mon Nov 3 12:12:34 PST 2025
On Sat, 01 Nov 2025 04:23:22 PDT (-0700), Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
>> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
>> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
>> > > I just realised that patch doesn't touch percpu.h at all. So what about
>> > > something like (untested):
>> > >
>> > > -----------------8<------------------------
>> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
>> > > index 9abcc8ef3087..e381034324e1 100644
>> > > --- a/arch/arm64/include/asm/percpu.h
>> > > +++ b/arch/arm64/include/asm/percpu.h
>> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
>> > > unsigned int loop; \
>> > > u##sz tmp; \
>> > > \
>> > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>> > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
>> > > /* LL/SC */ \
>> > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \
>> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \
>> > > unsigned int loop; \
>> > > u##sz ret; \
>> > > \
>> > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>> > > asm volatile (ARM64_LSE_ATOMIC_INSN( \
>> > > /* LL/SC */ \
>> > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \
>> > > -----------------8<------------------------
>> >
>> > I will give this a shot, thank you!
>>
>> Jackpot!!!
>>
>> This reduces the overhead to 8.427, which is significantly better than
>> the non-LSE value of 9.853. Still room for improvement, but much
>> better than the 100ns values.
>>
>> I presume that you will send this up the normal path, but in the meantime,
>> I will pull this in for further local testing, and thank you!
>
> I think for this specific case it may work, for the futex as well but
> not generally. The Neoverse-V2 TRM lists some controls in the
> IMP_CPUECTLR_EL1, bits 29 to 33:
>
> https://developer.arm.com/documentation/102375/0002
>
> These can be configured depending on the system configuration but they
> are too big knobs to cover all use-cases within an OS. This register is
> typically configured by firmware, we don't touch it in Linux.
Mostly for Paul:
I have patch to let you do this from Linux, and I have some
firmware for some of these internal systems that lets you set most of
these magic bits. I've noticed some unexpected behavior around prefetch
distance on an internal workload, but haven't gotten much farther there.
There's also some other bits that to wacky things...
Just FYI: Marc described trying to set these dynamically as trying to
swallow a running chainsaw, but LMK if you're feeling risky and I can
try and get you a copy of my setup. They seem to work fine for me ;)
> I'll dig some more but we may have to do tricks like prefetch if we
> can't find a hardware configuration that satisfies all cases.
More information about the linux-arm-kernel
mailing list