Slowdown copying data between kernel versions 4.19 and 5.15

Mark Rutland mark.rutland at arm.com
Thu Jun 29 07:24:33 PDT 2023


On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >After some investigation I am guessing the issue is either in the iovector
> >iteration changes (around
> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
> >of my depth so it is just speculation. 
> 
> After comparing the dissassembly of __arch_copy_from_user on both kernels and
> going through commit logs, I figured out the slowdown was mostly due to to
> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
> the changes to uao_ldp. 

For the benefit of others, that's commit:

  fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")

> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>  #endif
>  
>  /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
>   * This is complicated as there is no post-increment or pair versions of the
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
> -#ifdef CONFIG_ARM64_UAO
>         .macro uao_ldp l, reg1, reg2, addr, post_inc
> -               alternative_if_not ARM64_HAS_UAO
> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
> -8889:                  nop;
> -                       nop;
> -               alternative_else
> -                       ldtr    \reg1, [\addr];
> -                       ldtr    \reg2, [\addr, #8];
> -                       add     \addr, \addr, \post_inc;
> -               alternative_endif
> +8888:          ldtr    \reg1, [\addr];
> +8889:          ldtr    \reg2, [\addr, #8];
> +               add     \addr, \addr, \post_inc;
>  
>                 _asm_extable    8888b,\l;
>                 _asm_extable    8889b,\l;
>         .endm
> 
> I could not directly revert the changes to test since more names changed in
> other commits than I cared to figure out, but I hacked out that change, and
> saw that the performance of the test program was basically back to normal. 
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
>         .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;

As Catalin noted, we can't make that change generally as it'd be broken for any
system with PAN, and in general we *really* want to use LDTR/STTR for user
accesses to catch any misuse with kernel pointers.

> Profiling with the hacked __arch_copy_from_user 
> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>  Performance counter stats for '/mnt/usrroot/test_copy':
> 
>           11822342      instructions              #    0.23  insn per cycle         
>           50689594      cycles                                                      
>           37627922      ld_dep_stall                                                
>              17933      read_alloc                                                  
>               3421      dTLB-load-misses                                            
> 
>        0.043440253 seconds time elapsed
> 
>        0.004382000 seconds user
>        0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable solution
> for us. Also, on our actual workload there is still a small difference in
> performance remaining that I have not tracked down yet (I am guessing it has
> to do with the dTLB-load-misses remaining higher). 
> 
> Note, I think that the slow down is only noticeable in cases like ours where
> the data being copied from is not in cache (for us, because the FPGA writes
> it).

When you say "is not in cache", what exactly do you mean? If this were just the
latency of filling a cache I wouldn't expect the size of the first access to
make a difference, so I'm assuming the source buffer is not mapped with
cacheable memory attributes, which we generally assume.

Which memory attribues are the source and destination buffers mapped with? Is
that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?

I'm assuming this is with some out-of-tree driver; if that's in a public tree
could you please provide a pointer to it?

Thanks,
Mark.



More information about the linux-arm-kernel mailing list