Overhead of arm64 LSE per-CPU atomics?

Sat Nov 1 11:07:26 PDT 2025

On Sat, Nov 01, 2025 at 10:44:48AM +0100, Willy Tarreau wrote:
> Hi!
> 
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > > -----------------8<------------------------
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > >  	unsigned int loop;						\
> > > >  	u##sz tmp;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > >  	unsigned int loop;						\
> > > >  	u##sz ret;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > -----------------8<------------------------
> > > 
> > > I will give this a shot, thank you!
> > 
> > Jackpot!!!
> > 
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853.  Still room for improvement, but much
> > better than the 100ns values.
> 
> This is super interesting! I've blindly applied a similar change to all
> of our atomics in haproxy and am seeing a consistent 2-7% perf increase
> depending on the tests on a 80-core Ampere Altra (neoverse-n1). There
> as well we're significantly using atomics to read/update mostly local
> variables as we avoid sharing as much as possible. I'm pretty sure it
> does hurt in certain cases, and we don't have this distinction of per_cpu
> variants like here, however that makes me think about adding a "mostly
> local" variant that we can choose from depending on the context. I'll
> continue to experiment, thanks for sharing this trick (particularly to
> Yicong Yang, the original reporter).

Agreed!

And before I forget (again!):

Tested-by: Paul E. McKenney <paulmck at kernel.org>

							Thanx, Paul