[PATCH v5] arm64: Implement optimised checksum routine
zhangshaokun at hisilicon.com
Thu Jan 16 05:59:30 PST 2020
On 2020/1/16 18:55, Will Deacon wrote:
> On Wed, Jan 15, 2020 at 04:42:39PM +0000, Robin Murphy wrote:
>> Apparently there exist certain workloads which rely heavily on software
>> checksumming, for which the generic do_csum() implementation becomes a
>> significant bottleneck. Therefore let's give arm64 its own optimised
>> version - for ease of maintenance this foregoes assembly or intrisics,
>> and is thus not actually arm64-specific, but does rely heavily on C
>> idioms that translate well to the A64 ISA and the typical load/store
>> capabilities of most ARMv8 CPU cores.
>> The resulting increase in checksum throughput scales nicely with buffer
>> size, tending towards 4x for a small in-order core (Cortex-A53), and up
>> to 6x or more for an aggressive big core (Ampere eMAG).
>> Signed-off-by: Robin Murphy <robin.murphy at arm.com>
>> I rigged up a simple userspace test to run the generic and new code for
>> various buffer lengths at aligned and unaligned offsets; data is average
>> runtime in nanoseconds.
> Shaokun, Yuke -- please can you give this a spin and let us know how it
> works for you? If it looks good, then I can queue it up today/tomorrow.
Lingyan has tested this patch, the result is as follow:
1000loop general(ns) csum_hly_128B.c(ns) csum_robin_v5.s(ns)
64B: 48510 40730 37440
256B: 104180 59330 50210
1023B: 328580 124600 89960
1024B: 327880 125300 88520
1500B: 466440 165090 113560
2048B: 632060 212470 158320
4095B: 1219850 393080 263940
4096B: 1222740 399200 262550
It's better than Lingyan's patch v4, Thanks for Robin's work.
If you are happy, please feel free to add:
Reported-by: Lingyan Huang <huanglingyan2 at huawei.com>
Tested-by: Lingyan Huang <huanglingyan2 at huawei.com>
More information about the linux-arm-kernel