[PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops

Mon Dec 13 11:40:23 PST 2021

On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote:
> We have overly conservative assembly constraints for the basic FEAT_LSE
> atomic instructions, and using more accurate and permissive constraints
> will allow for better code generation.
> 
> The FEAT_LSE basic atomic instructions have come in two forms:
> 
> 	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> 
> The ST* forms are aliases of the LD* forms where:
> 
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> Is:
> 	LD{op}{order}{size} <Rs>, XZR, [<Rn>]
> 
> For either form, both <Rs> and <Rn> are read but not written back to,
> and <Rt> is written with the original value of the memory location.
> Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
> other register value(s) are consumed. There are no UNPREDICTABLE or
> CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
> <Rn> are the same register.
> 
> Our current inline assembly always uses <Rs> == <Rt>, treating this
> register as both an input and an output (using a '+r' constraint). This
> forces the compiler to do some unnecessary register shuffling and/or
> redundant value generation.
> 
> For example, the compiler cannot reuse the <Rs> value, and currently GCC
> 11.1.0 will compile:
> 
> 	__lse_atomic_add(1, a);
> 	__lse_atomic_add(1, b);
> 	__lse_atomic_add(1, c);
> 
> As:
> 
> 	mov     w3, #0x1
> 	mov     w4, w3
> 	stadd   w4, [x0]
> 	mov     w0, w3
> 	stadd   w0, [x1]
> 	stadd   w3, [x2]
> 
> We can improve this with more accurate constraints, separating <Rs> and
> <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
> output-only value ('=r'). As <Rt> is written back after <Rs> is
> consumed, it does not need to be earlyclobber ('=&r'), leaving the
> compiler free to use the same register for both <Rs> and <Rt> where this
> is desirable.
> 
> At the same time, the redundant 'r' constraint for `v` is removed, as
> the `+Q` constraint is sufficient.
> 
> With this change, the above example becomes:
> 
> 	mov     w3, #0x1
> 	stadd   w3, [x0]
> 	stadd   w3, [x1]
> 	stadd   w3, [x2]
> 
> I've made this change for the non-value-returning and FETCH ops. The
> RETURN ops have a multi-instruction sequence for which we cannot use the
> same constraints, and a subsequent patch will rewrite hte RETURN ops in
> terms of the FETCH ops, relying on the ability for the compiler to reuse
> the <Rs> value.
> 
> This is intended as an optimization.
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland at arm.com>
> Cc: Boqun Feng <boqun.feng at gmail.com>
> Cc: Catalin Marinas <catalin.marinas at arm.com>
> Cc: Peter Zijlstra <peterz at infradead.org>
> Cc: Will Deacon <will at kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)

Makes sense to me. I'm not sure _why_ the current constraints are so weird;
maybe a hangover from when we patched them inline? Anywho:

Acked-by: Will Deacon <will at kernel.org>

Will