[PATCH] arm64: io: permit offset addressing
Mark Rutland
mark.rutland at arm.com
Thu Jan 25 03:22:07 PST 2024
Hi,
[adding LLVM folk for codegen concern below]
On Wed, Jan 24, 2024 at 01:24:18PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 11:12:59AM +0000, Mark Rutland wrote:
> > Currently our IO accessors all use register addressing without offsets,
> > but we could safely use offset addressing (without writeback) to
> > simplify and optimize the generated code.
>
> > Aside from the better code generation, there should be no functional
> > change as a result of this patch. I have lightly tested this patch,
> > including booting under KVM (where some devices such as PL011 are
> > emulated).
>
> Reviewed-by: Jason Gunthorpe <jgg at nvidia.com>
>
> FWIW I have had 0-day compile test a very similar patch with no
> compilation problems.
>
> I also got similar code savings results.
Thanks; much appreciated!
> However, I noticed that clang 17.0.6 at least does not have any
> benifit, I suppose it is a missing compiler feature.
Hmm; yes, clang doesn't seem to handle this well.
LLVM folk, any idea why clang can't make use of offset addressing for the
inline assembly in the original patch?
https://lore.kernel.org/linux-arm-kernel/20240124111259.874975-1-mark.rutland@arm.com/
I've written a self-contained test case below to demonstrate; I can't convince
clang to generate offset addressing modes whereas gcc will, meaning gcc
generates much nicer code. I'm not sure if this is just something clang can't
currently do or if we're tickling some edge case.
I also note it's not directly using XZR for the value; I'm not sure if I'm
doing something wrong here.
| [mark at lakrids:~/src/linux]% cat asm-test.c
| #define __always_inline inline __attribute__((always_inline))
| #define u64 unsigned long
|
| static __always_inline void writeq(u64 val, volatile u64 *addr)
| {
| asm volatile(
| " str %x[val], %[addr]\n"
| :
| : [addr] "Qo" (*addr),
| [val] "rZ" (val)
| );
| }
|
| void writeq_zero_8_times(volatile void *addr)
| {
| writeq(0, addr + 8 * 0);
| writeq(0, addr + 8 * 1);
| writeq(0, addr + 8 * 2);
| writeq(0, addr + 8 * 3);
| writeq(0, addr + 8 * 4);
| writeq(0, addr + 8 * 5);
| writeq(0, addr + 8 * 6);
| writeq(0, addr + 8 * 7);
| }
| [mark at lakrids:~/src/linux]% /mnt/data/opt/toolchain/kernel-org-llvm/17.0.6/llvm-17.0.6-x86_64/bin/clang --target=aarch64-linux -c asm-test.c -O2
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-objdump -d asm-test.o
|
| asm-test.o: file format elf64-littleaarch64
|
|
| Disassembly of section .text:
|
| 0000000000000000 <writeq_zero_8_times>:
| 0: 91002009 add x9, x0, #0x8
| 4: aa1f03e8 mov x8, xzr
| 8: 9100400a add x10, x0, #0x10
| c: 9100600b add x11, x0, #0x18
| 10: 9100800c add x12, x0, #0x20
| 14: 9100a00d add x13, x0, #0x28
| 18: f9000008 str x8, [x0]
| 1c: f9000128 str x8, [x9]
| 20: 9100c009 add x9, x0, #0x30
| 24: 9100e00e add x14, x0, #0x38
| 28: f9000148 str x8, [x10]
| 2c: f9000168 str x8, [x11]
| 30: f9000188 str x8, [x12]
| 34: f90001a8 str x8, [x13]
| 38: f9000128 str x8, [x9]
| 3c: f90001c8 str x8, [x14]
| 40: d65f03c0 ret
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-gcc -c asm-test.c -O2
| [mark at lakrids:~/src/linux]% usekorg 13.2.0 aarch64-linux-objdump -d asm-test.o
|
| asm-test.o: file format elf64-littleaarch64
|
|
| Disassembly of section .text:
|
| 0000000000000000 <writeq_zero_8_times>:
| 0: f900001f str xzr, [x0]
| 4: f900041f str xzr, [x0, #8]
| 8: f900081f str xzr, [x0, #16]
| c: f9000c1f str xzr, [x0, #24]
| 10: f900101f str xzr, [x0, #32]
| 14: f900141f str xzr, [x0, #40]
| 18: f900181f str xzr, [x0, #48]
| 1c: f9001c1f str xzr, [x0, #56]
| 20: d65f03c0 ret
In case this was an edge case I tried a few variations, which didn't seem to
help:
* Using different constraints (I tried input constraints {"Qo", "o", "m"} and
output constraints {"=Qo" "=o", "=m"})
* Removing "volatile" in case that was getting in the way of some optimization
* Turning writeq() into a macro in case this was a problem with inlining.
> Finally, in my experiments with the WC issue it wasn't entirely
> helpful, I could not get both gcc and clang always generate load-free
> store sequences for 64 bytes even with this.
Fair enough, thanks for having a go.
FWIW, for the WC case I'd be happy to have inline asm for 8 back-to-back STRs,
if that would help? i.e. like before but with 8 STRs instead of 4 STPs.
Mark.
More information about the linux-arm-kernel
mailing list