Slowdown copying data between kernel versions 4.19 and 5.15
Mark Rutland
mark.rutland at arm.com
Thu Jun 29 07:24:33 PDT 2023
On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >After some investigation I am guessing the issue is either in the iovector
> >iteration changes (around
> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
> >of my depth so it is just speculation.
>
> After comparing the dissassembly of __arch_copy_from_user on both kernels and
> going through commit logs, I figured out the slowdown was mostly due to to
> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
> the changes to uao_ldp.
For the benefit of others, that's commit:
fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")
>
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
> #endif
>
> /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
> * This is complicated as there is no post-increment or pair versions of the
> * unprivileged instructions, and USER() only works for single instructions.
> */
> -#ifdef CONFIG_ARM64_UAO
> .macro uao_ldp l, reg1, reg2, addr, post_inc
> - alternative_if_not ARM64_HAS_UAO
> -8888: ldp \reg1, \reg2, [\addr], \post_inc;
> -8889: nop;
> - nop;
> - alternative_else
> - ldtr \reg1, [\addr];
> - ldtr \reg2, [\addr, #8];
> - add \addr, \addr, \post_inc;
> - alternative_endif
> +8888: ldtr \reg1, [\addr];
> +8889: ldtr \reg2, [\addr, #8];
> + add \addr, \addr, \post_inc;
>
> _asm_extable 8888b,\l;
> _asm_extable 8889b,\l;
> .endm
>
> I could not directly revert the changes to test since more names changed in
> other commits than I cared to figure out, but I hacked out that change, and
> saw that the performance of the test program was basically back to normal.
>
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
> * unprivileged instructions, and USER() only works for single instructions.
> */
> .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888: ldtr \reg1, [\addr];
> -8889: ldtr \reg2, [\addr, #8];
> - add \addr, \addr, \post_inc;
> +8888: ldp \reg1, \reg2, [\addr], \post_inc;
> +8889: nop;
> + nop;
As Catalin noted, we can't make that change generally as it'd be broken for any
system with PAN, and in general we *really* want to use LDTR/STTR for user
accesses to catch any misuse with kernel pointers.
> Profiling with the hacked __arch_copy_from_user
> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>
> Performance counter stats for '/mnt/usrroot/test_copy':
>
> 11822342 instructions # 0.23 insn per cycle
> 50689594 cycles
> 37627922 ld_dep_stall
> 17933 read_alloc
> 3421 dTLB-load-misses
>
> 0.043440253 seconds time elapsed
>
> 0.004382000 seconds user
> 0.039442000 seconds sys
>
> Unfortunately the hack crashes in other cases so it is not a viable solution
> for us. Also, on our actual workload there is still a small difference in
> performance remaining that I have not tracked down yet (I am guessing it has
> to do with the dTLB-load-misses remaining higher).
>
> Note, I think that the slow down is only noticeable in cases like ours where
> the data being copied from is not in cache (for us, because the FPGA writes
> it).
When you say "is not in cache", what exactly do you mean? If this were just the
latency of filling a cache I wouldn't expect the size of the first access to
make a difference, so I'm assuming the source buffer is not mapped with
cacheable memory attributes, which we generally assume.
Which memory attribues are the source and destination buffers mapped with? Is
that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
I'm assuming this is with some out-of-tree driver; if that's in a public tree
could you please provide a pointer to it?
Thanks,
Mark.
More information about the linux-arm-kernel
mailing list