Overhead of arm64 LSE per-CPU atomics?
Yicong Yang
yangyccccc at gmail.com
Sat Nov 1 04:41:02 PDT 2025
On 2025/11/1 19:23, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
>> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
>>> On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
>>>> I just realised that patch doesn't touch percpu.h at all. So what about
>>>> something like (untested):
>>>>
>>>> -----------------8<------------------------
>>>> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
>>>> index 9abcc8ef3087..e381034324e1 100644
>>>> --- a/arch/arm64/include/asm/percpu.h
>>>> +++ b/arch/arm64/include/asm/percpu.h
>>>> @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
>>>> unsigned int loop; \
>>>> u##sz tmp; \
>>>> \
>>>> + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>>>> asm volatile (ARM64_LSE_ATOMIC_INSN( \
>>>> /* LL/SC */ \
>>>> "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \
>>>> @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \
>>>> unsigned int loop; \
>>>> u##sz ret; \
>>>> \
>>>> + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>>>> asm volatile (ARM64_LSE_ATOMIC_INSN( \
>>>> /* LL/SC */ \
>>>> "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \
>>>> -----------------8<------------------------
>>> I will give this a shot, thank you!
>> Jackpot!!!
>>
>> This reduces the overhead to 8.427, which is significantly better than
>> the non-LSE value of 9.853. Still room for improvement, but much
>> better than the 100ns values.
>>
>> I presume that you will send this up the normal path, but in the meantime,
>> I will pull this in for further local testing, and thank you!
> I think for this specific case it may work, for the futex as well but
> not generally. The Neoverse-V2 TRM lists some controls in the
> IMP_CPUECTLR_EL1, bits 29 to 33:
>
> https://developer.arm.com/documentation/102375/0002
>
> These can be configured depending on the system configuration but they
> are too big knobs to cover all use-cases within an OS. This register is
> typically configured by firmware, we don't touch it in Linux.
>
> I'll dig some more but we may have to do tricks like prefetch if we
> can't find a hardware configuration that satisfies all cases.
>
FYI, there's a version to allow prefetech added prior to LSE opertaions by one boot option [1],
if we want to reconsidered in this way, it's more flexible and can be controlled by the OS without touching
the system configurations (may need to update the firmware). But need to add the prefetch in per-cpu
implementation as you've noticed above (didn't add it since no prefetch for LL/SC implementation there,
maybe a missing?)
[1] https://lore.kernel.org/all/20250919091747.3702-1-yangyicong@huawei.com/
thanks.
More information about the linux-arm-kernel
mailing list