[PATCH v4] arm64: lib: accelerate do_csum

Robin Murphy robin.murphy at arm.com
Tue Jan 14 04:18:46 PST 2020


On 2020-01-11 8:09 am, Shaokun Zhang wrote:
> +Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged
> the performance issue.
> 
> Hi Will,
> 
> Thanks for reactivate this thread.
> Robin, any comments are welcome and hopefully it can be merged in mainline.

OK, I had a play with this yesterday, and somewhat surprisingly even 
with a recent GCC it results in utterly dreadful code. I would always 
have expected the head/tail alignment in __uint128_t arithmetic to be 
ugly, and it certainly is, but even the "*ptr++" load in the main loop 
comes out as this delightful nugget:

      e3c:       f8410502        ldr     x2, [x8], #16
      e40:       f85f8103        ldur    x3, [x8, #-8]

(Clang does at least manage to emit a post-indexed LDP there, but the 
rest remains pretty awful)

Overall it ends up noticeably slower than even the generic code for 
small buffers. I rigged up a crude userspace test to run the numbers 
below - data is average call time in nanoseconds; "new" is the routine 
from this patch, "new2/3/4" are are loop-tuning variations of what I 
came up with when I then went back to my WIP branch and finished off my 
original idea. Once I've confirmed I got big-endian right I'll send out 
another patch :)

Robin.


GCC 9.2.0:
---------
Cortex-A53
size            generic new     new2    new3    new4
        3:       20      35      22      22      24
        8:       34      35      22      22      24
       15:       36      35      29      23      25
       48:       69      45      38      38      39
       64:       80      50      49      44      44
      256:       217     117     99      110     92
     4096:       2908    1310    1146    1269    983
  1048576:       860430  461694  461694  493173  451201
Cortex-A72
size            generic new     new2    new3    new4
        3:       8       21      10      9       10
        8:       20      21      10      9       10
       15:       16      21      12      11      11
       48:       29      29      18      19      20
       64:       35      30      24      21      23
      256:       125     66      48      46      46
     4096:       1720    778     532     573     450
  1048576:       472187  272819  188874  220354  167888

Clang 9.0.1:
-----------
Cortex-A53
size            generic new     new2    new3    new4
        3:       21      29      21      21      21
        8:       33      29      21      21      21
       15:       35      28      24      23      23
       48:       73      39      36      37      38
       64:       85      44      46      42      44
      256:       220     110     107     107     89
     4096:       2949    1310    1187    1310    942
  1048576:       849937  451201  472187  482680  451201
Cortex-A72
size            generic new     new2    new3    new4
        3:       8       16      10      10      10
        8:       23      16      10      10      10
       15:       16      16      12      12      12
       48:       27      21      18      20      20
       64:       31      24      24      22      23
      256:       125     53      48      63      46
     4096:       1720    655     573     860     532
  1048576:       472187  230847  209861  272819  188874

> 
> Thanks,
> Shaokun
> 
> On 2020/1/9 1:20, Will Deacon wrote:
>> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
>>> From: Lingyan Huang <huanglingyan2 at huawei.com>
>>>
>>> Function do_csum() in lib/checksum.c is used to compute checksum,
>>> which is turned out to be slowly and costs a lot of resources.
>>> Let's accelerate the checksum computation for arm64.
>>>
>>> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>>>   1cycle  general(ns)  csum_128(ns) csum_64(ns)
>>>    64B:        160            80             50
>>>   256B:        120            70             60
>>> 1023B:        350           140            150
>>> 1024B:        350           130            140
>>> 1500B:        470           170            180
>>> 2048B:        630           210            240
>>> 4095B:       1220           390            430
>>> 4096B:       1230           390            430
>>>
>>> Cc: Will Deacon <will at kernel.org>
>>> Cc: Robin Murphy <robin.murphy at arm.com>
>>> Cc: Catalin Marinas <catalin.marinas at arm.com>
>>> Cc: Ard Biesheuvel <ard.biesheuvel at linaro.org>
>>> Originally-from: Robin Murphy <robin.murphy at arm.com>
>>> Signed-off-by: Lingyan Huang <huanglingyan2 at huawei.com>
>>> Signed-off-by: Shaokun Zhang <zhangshaokun at hisilicon.com>
>>> ---
>>> Hi,
>>> Apologies that we post this version so later, because we want to
>>> optimise it better, Lingyan tested it performance which is attached
>>> in commit log. Both(128 and 64) are much better than the initial
>>> code.
>>> ChangeLog:
>>>      based on Robin's code and change strides from 64 to 128.
>>>
>>>   arch/arm64/include/asm/checksum.h |  3 ++
>>>   arch/arm64/lib/Makefile           |  2 +-
>>>   arch/arm64/lib/csum.c             | 81 +++++++++++++++++++++++++++++++++++++++
>>>   3 files changed, 85 insertions(+), 1 deletion(-)
>>>   create mode 100644 arch/arm64/lib/csum.c
>>
>> Robin -- any chance you could look at this please? If it's based on your
>> code then hopefully it's straightforward to review ;)
>>
>> Will
>>
>> .
>>
> 



More information about the linux-arm-kernel mailing list