Slowdown copying data between kernel versions 4.19 and 5.15

Thu Jun 29 06:09:47 PDT 2023

On 2023-06-28 22:38, Havens, Austin wrote:
> 
> <Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)>
>> Hi all,
>> In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data.  We are using  Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind).  I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call.
>>
>   >I made  a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler).
>>
>> root at smudge:/tmp# uname -a
>> Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>> root at smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>            13202623      instructions              #    0.25  insn per cycle
>>            52947780      cycles
>>            37588761      ld_dep_stall
>>               16301      read_alloc
>>                1660      dTLB-load-misses
>>
>>         0.044990363 seconds time elapsed
>>
>>        0.004092000 seconds user
>>         0.040920000 seconds sys
>>
>> root at ahraptor:/tmp# uname -a
>> Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>           11625888      instructions              #    0.14  insn per cycle
>>            83135040      cycles
>>            69833562      ld_dep_stall
>>               27948      read_alloc
>>                3367      dTLB-load-misses
>>
>>         0.070537894 seconds time elapsed
>>
>>         0.004165000 seconds user
>>         0.066643000 seconds sys
>>
> 
>> After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation.
>>
> 
> After comparing the dissassembly of __arch_copy_from_user on both kernels and going through commit logs, I figured out the slowdown was mostly due to to the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to uao_ldp.
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>   #endif
>   
>   /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
>    * This is complicated as there is no post-increment or pair versions of the
>    * unprivileged instructions, and USER() only works for single instructions.
>    */
> -#ifdef CONFIG_ARM64_UAO
>          .macro uao_ldp l, reg1, reg2, addr, post_inc
> -               alternative_if_not ARM64_HAS_UAO
> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
> -8889:                  nop;
> -                       nop;
> -               alternative_else
> -                       ldtr    \reg1, [\addr];
> -                       ldtr    \reg2, [\addr, #8];
> -                       add     \addr, \addr, \post_inc;
> -               alternative_endif
> +8888:          ldtr    \reg1, [\addr];
> +8889:          ldtr    \reg2, [\addr, #8];
> +               add     \addr, \addr, \post_inc;
>   
>                  _asm_extable    8888b,\l;
>                  _asm_extable    8889b,\l;
>          .endm
> 
> I could not directly revert the changes to test since more names changed in other commits than I cared to figure out, but I hacked out that change, and saw that the performance of the test program was basically back to normal.
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>    * unprivileged instructions, and USER() only works for single instructions.
>    */
>          .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;
> 
> 
> Profiling with the hacked __arch_copy_from_user
> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>   Performance counter stats for '/mnt/usrroot/test_copy':
> 
>            11822342      instructions              #    0.23  insn per cycle
>            50689594      cycles
>            37627922      ld_dep_stall
>               17933      read_alloc
>                3421      dTLB-load-misses
> 
>         0.043440253 seconds time elapsed
> 
>         0.004382000 seconds user
>         0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable solution for us. Also, on our actual workload there is still a small difference in performance remaining that I have not tracked down yet (I am guessing it has to do with the dTLB-load-misses remaining higher).
> 
> Note, I think that the slow down is only noticeable in cases like ours where the data being copied from is not in cache (for us, because the FPGA writes it).
> 
> If anyone knows if this slowdown is expected, or if there are any workarounds, that would be helpful.

Some slowdown is to be expected after the removal of set_fs() in 5.11 
means we always use the unprivileged load/store instructions for 
userspace access. There are no register-pair forms of LDTR/STTR, so the 
uaccess routines become inherently less efficient than a regular 
memcpy() could achieve (the doubling seen in those perf event counts 
represents the fact that the CPU is literally issuing twice as many load 
ops), but the tradeoff is that they are now significantly more robust 
and harder to exploit.

There is nominally a bit of performance still left on the table for 
smaller copies below 1-2KB, and particularly below 128 bytes or so, due 
to the current compromises made for fault handling. I had a go at 
improving this a while back with [1], but that series got parked since 
my copy_to_user() has an insidious bug which we couldn't easily figure 
out, and frankly all the fixup shenanigans is somewhat of a deranged 
nightmare that even I can no longer make sense of. Mark then tried a 
more rigorous approach of attempting to nail down the API semantics with 
tests first[2], but that ended up just leaving more open questions that 
nobody's found the time to think about further.

Thanks,
Robin.

[1] https://lore.kernel.org/r/cover.1664363162.git.robin.murphy@arm.com/
[2] https://lore.kernel.org/r/20230321122514.1743889-1-mark.rutland@arm.com/