[PATCH] arm64: io: permit offset addressing

Thu Jan 25 03:22:07 PST 2024

Hi,

[adding LLVM folk for codegen concern below]

On Wed, Jan 24, 2024 at 01:24:18PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 11:12:59AM +0000, Mark Rutland wrote:
> > Currently our IO accessors all use register addressing without offsets,
> > but we could safely use offset addressing (without writeback) to
> > simplify and optimize the generated code.
> 
> > Aside from the better code generation, there should be no functional
> > change as a result of this patch. I have lightly tested this patch,
> > including booting under KVM (where some devices such as PL011 are
> > emulated).
> 
> Reviewed-by: Jason Gunthorpe <jgg at nvidia.com>
> 
> FWIW I have had 0-day compile test a very similar patch with no
> compilation problems.
> 
> I also got similar code savings results.

Thanks; much appreciated!

> However, I noticed that clang 17.0.6 at least does not have any
> benifit, I suppose it is a missing compiler feature.

Hmm; yes, clang doesn't seem to handle this well.

LLVM folk, any idea why clang can't make use of offset addressing for the
inline assembly in the original patch?

  https://lore.kernel.org/linux-arm-kernel/20240124111259.874975-1-mark.rutland@arm.com/

I've written a self-contained test case below to demonstrate; I can't convince
clang to generate offset addressing modes whereas gcc will, meaning gcc
generates much nicer code. I'm not sure if this is just something clang can't
currently do or if we're tickling some edge case.

I also note it's not directly using XZR for the value; I'm not sure if I'm
doing something wrong here.

| [mark at lakrids:~/src/linux]% cat asm-test.c
| #define __always_inline inline __attribute__((always_inline))
| #define u64 unsigned long
| 
| static __always_inline void writeq(u64 val, volatile u64 *addr)
| {
|         asm volatile(
|         "       str     %x[val], %[addr]\n"
|         :
|         : [addr] "Qo" (*addr),
|           [val] "rZ" (val)
|         );
| }
| 
| void writeq_zero_8_times(volatile void *addr)
| {
|         writeq(0, addr + 8 * 0);
|         writeq(0, addr + 8 * 1);
|         writeq(0, addr + 8 * 2);
|         writeq(0, addr + 8 * 3);
|         writeq(0, addr + 8 * 4);
|         writeq(0, addr + 8 * 5);
|         writeq(0, addr + 8 * 6);
|         writeq(0, addr + 8 * 7);
| }
| [mark at lakrids:~/src/linux]% /mnt/data/opt/toolchain/kernel-org-llvm/17.0.6/llvm-17.0.6-x86_64/bin/clang --target=aarch64-linux -c asm-test.c -O2
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-objdump -d asm-test.o
| 
| asm-test.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <writeq_zero_8_times>:
|    0:   91002009        add     x9, x0, #0x8
|    4:   aa1f03e8        mov     x8, xzr
|    8:   9100400a        add     x10, x0, #0x10
|    c:   9100600b        add     x11, x0, #0x18
|   10:   9100800c        add     x12, x0, #0x20
|   14:   9100a00d        add     x13, x0, #0x28
|   18:   f9000008        str     x8, [x0]
|   1c:   f9000128        str     x8, [x9]
|   20:   9100c009        add     x9, x0, #0x30
|   24:   9100e00e        add     x14, x0, #0x38
|   28:   f9000148        str     x8, [x10]
|   2c:   f9000168        str     x8, [x11]
|   30:   f9000188        str     x8, [x12]
|   34:   f90001a8        str     x8, [x13]
|   38:   f9000128        str     x8, [x9]
|   3c:   f90001c8        str     x8, [x14]
|   40:   d65f03c0        ret
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-gcc -c asm-test.c -O2
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-objdump -d asm-test.o
| 
| asm-test.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <writeq_zero_8_times>:
|    0:   f900001f        str     xzr, [x0]
|    4:   f900041f        str     xzr, [x0, #8]
|    8:   f900081f        str     xzr, [x0, #16]
|    c:   f9000c1f        str     xzr, [x0, #24]
|   10:   f900101f        str     xzr, [x0, #32]
|   14:   f900141f        str     xzr, [x0, #40]
|   18:   f900181f        str     xzr, [x0, #48]
|   1c:   f9001c1f        str     xzr, [x0, #56]
|   20:   d65f03c0        ret

In case this was an edge case I tried a few variations, which didn't seem to
help:

* Using different constraints (I tried input constraints {"Qo", "o", "m"} and
  output constraints {"=Qo" "=o", "=m"})

* Removing "volatile" in case that was getting in the way of some optimization

* Turning writeq() into a macro in case this was a problem with inlining.

> Finally, in my experiments with the WC issue it wasn't entirely
> helpful, I could not get both gcc and clang always generate load-free
> store sequences for 64 bytes even with this.

Fair enough, thanks for having a go.

FWIW, for the WC case I'd be happy to have inline asm for 8 back-to-back STRs,
if that would help? i.e. like before but with 8 STRs instead of 4 STPs.

Mark.