[PATCH 0/1] riscv: better network performance with memcpy, uaccess

Fri Jun 4 09:19:53 PDT 2021

On Fri, 04 Jun 2021 02:53:33 PDT (-0700), akira.tsukamoto at gmail.com wrote:
> I am adding a cover letter to explain the history and details since
> improvement is a combination with Gary's memcpy patch [1].
>
> Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
> my uaccess optimization patch. All results are from the same base kernel,
> same rootfs and save BeagleV beta board.
>
> First left column : beaglev 5.13.rc4 kernel [2]
> Second column     : Added Palmer's memcpy in C + my uaccess patch [3]
> Third column      : Added Gary's memcpy + my uaccess patch [4]
>
> --- TCP recv ---
> 686 Mbits/sec  |  700 Mbits/sec  |  904 Mbits/sec
> 683 Mbits/sec  |  701 Mbits/sec  |  898 Mbits/sec
> 695 Mbits/sec  |  702 Mbits/sec  |  905 Mbits/sec
>
> --- TCP send ---
> 383 Mbits/sec  |  390 Mbits/sec  |  393 Mbits/sec
> 384 Mbits/sec  |  393 Mbits/sec  |  392 Mbits/sec
>
> --- UDP send ---
> 307 Mbits/sec  |  358 Mbits/sec  |  402 Mbits/sec
> 307 Mbits/sec  |  359 Mbits/sec  |  402 Mbits/sec
>
> --- UDP recv ---
> 630 Mbits/sec  |  799 Mbits/sec  |  875 Mbits/sec
> 730 Mbits/sec  |  796 Mbits/sec  |  873 Mbits/sec
>
>
> The uaccess patch is reducing pipeline stall of read after write (RAW)
> by unroling load and store.
> The main reason for using assembler inside uaccess.S is because the
> __asm_to/copy_from_user() handling page fault must be done manually inside
> the functions.
>
> The above result is combination from Gary $B!G (Bs memcpy speeding up
> by reducing
> the S-mode and M-mode switching and my uaccess reducing pipeline stall for
> user space uses syscall with large data.
>
> We had a discussion of improving network performance on the BeagleV beta
> board with Palmer.
>
> Palmer suggested to use C-based string routines, which checks the unaligned
> address and use 8 bytes aligned copy if the both src and dest are aligned
> and if not use the current copy function.
>
> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.
>
> Processing network packets require a lot of unaligned access for the packet
> header, which is not able to change the design of the header format to be
> aligned.
> And user applications pass large packet data with send/recf() and sendto/
> recvfrom() to repeat less function calls for reading and writing data for the
> optimization.

Makes sense.  I'm still not opposed to moving to a C version, but it'd 
need to be a fairly complicated one.  I think having a fast C memcpy 
would likely benefit a handful of architectures, as everything we're 
talking about is an algorithmic improvement that can be expressed in C.

Given that the simple memcpy doesn't perform well for your workload, I'm 
fine taking the assembly version.

Thanks!

>
> Akira
>
> [1] https://lkml.org/lkml/2021/2/16/778
> [2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
> [3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
> [4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary
>
> Akira Tsukamoto (1):
>   riscv: prevent pipeline stall in __asm_to/copy_from_user
>
>  arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
>  1 file changed, 73 insertions(+), 33 deletions(-)