Slowdown copying data between kernel versions 4.19 and 5.15

Thu Jun 29 11:33:32 PDT 2023

On 2023-06-29 15:24, Mark Rutland wrote:
> On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>>> After some investigation I am guessing the issue is either in the iovector
>>> iteration changes (around
>>> https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
>>> lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
>>> of my depth so it is just speculation.
>>
>> After comparing the dissassembly of __arch_copy_from_user on both kernels and
>> going through commit logs, I figured out the slowdown was mostly due to to
>> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
>> the changes to uao_ldp.
> 
> For the benefit of others, that's commit:
> 
>    fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")
> 
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index 2c26ca5b7bb0..2b5454fa0f24 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>>   #endif
>>   
>>   /*
>> - * Generate the assembly for UAO alternatives with exception table entries.
>> + * Generate the assembly for LDTR/STTR with exception table entries.
>>    * This is complicated as there is no post-increment or pair versions of the
>>    * unprivileged instructions, and USER() only works for single instructions.
>>    */
>> -#ifdef CONFIG_ARM64_UAO
>>          .macro uao_ldp l, reg1, reg2, addr, post_inc
>> -               alternative_if_not ARM64_HAS_UAO
>> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
>> -8889:                  nop;
>> -                       nop;
>> -               alternative_else
>> -                       ldtr    \reg1, [\addr];
>> -                       ldtr    \reg2, [\addr, #8];
>> -                       add     \addr, \addr, \post_inc;
>> -               alternative_endif
>> +8888:          ldtr    \reg1, [\addr];
>> +8889:          ldtr    \reg2, [\addr, #8];
>> +               add     \addr, \addr, \post_inc;
>>   
>>                  _asm_extable    8888b,\l;
>>                  _asm_extable    8889b,\l;
>>          .endm
>>
>> I could not directly revert the changes to test since more names changed in
>> other commits than I cared to figure out, but I hacked out that change, and
>> saw that the performance of the test program was basically back to normal.
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index ccedf548dac9..2ddf7eba46fd 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>>    * unprivileged instructions, and USER() only works for single instructions.
>>    */
>>          .macro user_ldp l, reg1, reg2, addr, post_inc
>> -8888:          ldtr    \reg1, [\addr];
>> -8889:          ldtr    \reg2, [\addr, #8];
>> -               add     \addr, \addr, \post_inc;
>> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
>> +8889:          nop;
>> +               nop;
> 
> As Catalin noted, we can't make that change generally as it'd be broken for any
> system with PAN, and in general we *really* want to use LDTR/STTR for user
> accesses to catch any misuse with kernel pointers.
> 
>> Profiling with the hacked __arch_copy_from_user
>> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>            11822342      instructions              #    0.23  insn per cycle
>>            50689594      cycles
>>            37627922      ld_dep_stall
>>               17933      read_alloc
>>                3421      dTLB-load-misses
>>
>>         0.043440253 seconds time elapsed
>>
>>         0.004382000 seconds user
>>         0.039442000 seconds sys
>>
>> Unfortunately the hack crashes in other cases so it is not a viable solution
>> for us. Also, on our actual workload there is still a small difference in
>> performance remaining that I have not tracked down yet (I am guessing it has
>> to do with the dTLB-load-misses remaining higher).
>>
>> Note, I think that the slow down is only noticeable in cases like ours where
>> the data being copied from is not in cache (for us, because the FPGA writes
>> it).
> 
> When you say "is not in cache", what exactly do you mean? If this were just the
> latency of filling a cache I wouldn't expect the size of the first access to
> make a difference, so I'm assuming the source buffer is not mapped with
> cacheable memory attributes, which we generally assume.
> 
> Which memory attribues are the source and destination buffers mapped with? Is
> that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> 
> I'm assuming this is with some out-of-tree driver; if that's in a public tree
> could you please provide a pointer to it?

Oh, re-reading the original mail, it looks like the copy is reading from 
a userspace mapping of /dev/mem, so it'll be Device - the ~50% 
performance drop did seem like more than I remember for Cortex-A53 in 
similar benchmarking (usercopy vs. optimal memcpy), but that was all 
cacheable, so doubling the number of actual memory system transactions 
does seem like it could plausibly account for the extra difference.

Cheers,
Robin.