[PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Thu Nov 5 02:51:08 EST 2020

On Thu, 5 Nov 2020 at 03:50, Li Qiang <liqiang64 at huawei.com> wrote:
>
> Hi Eric,
>
> 在 2020/11/5 1:57, Eric Biggers 写道:
> > On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
> >> From: liqiang <liqiang64 at huawei.com>
> >>
> >>      In the libz library, the checksum algorithm adler32 usually occupies
> >>      a relatively high hot spot, and the SVE instruction set can easily
> >>      accelerate it, so that the performance of libz library will be
> >>      significantly improved.
> >>
> >>      We can divides buf into blocks according to the bit width of SVE,
> >>      and then uses vector registers to perform operations in units of blocks
> >>      to achieve the purpose of acceleration.
> >>
> >>      On machines that support ARM64 sve instructions, this algorithm is
> >>      about 3~4 times faster than the algorithm implemented in C language
> >>      in libz. The wider the SVE instruction, the better the acceleration effect.
> >>
> >>      Measured on a Taishan 1951 machine that supports 256bit width SVE,
> >>      below are the results of my measured random data of 1M and 10M:
> >>
> >>              [root at xxx adler32]# ./benchmark 1000000
> >>              Libz alg: Time used:    608 us, 1644.7 Mb/s.
> >>              SVE  alg: Time used:    166 us, 6024.1 Mb/s.
> >>
> >>              [root at xxx adler32]# ./benchmark 10000000
> >>              Libz alg: Time used:   6484 us, 1542.3 Mb/s.
> >>              SVE  alg: Time used:   2034 us, 4916.4 Mb/s.
> >>
> >>      The blocks can be of any size, so the algorithm can automatically adapt
> >>      to SVE hardware with different bit widths without modifying the code.
> >>
> >>
> >> Signed-off-by: liqiang <liqiang64 at huawei.com>
> >
> > Note that this patch does nothing to actually wire up the kernel's copy of libz
> > (lib/zlib_{deflate,inflate}/) to use this implementation of Adler32.  To do so,
> > libz would either need to be changed to use the shash API, or you'd need to
> > implement an adler32() function in lib/crypto/ that automatically uses an
> > accelerated implementation if available, and make libz call it.
> >
> > Also, in either case a C implementation would be required too.  There can't be
> > just an architecture-specific implementation.
>
> Okay, thank you for the problems and suggestions you gave. I will continue to
> improve my code.
>
> >
> > Also as others have pointed out, there's probably not much point in having a SVE
> > implementation of Adler32 when there isn't even a NEON implementation yet.  It's
> > not too hard to implement Adler32 using NEON, and there are already several
> > permissively-licensed NEON implementations out there that could be used as a
> > reference, e.g. my implementation using NEON instrinsics here:
> > https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h
> >
> > - Eric
> > .
> >
>
> I am very happy to get this NEON implementation code. :)
>

Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
example of how to deal with intrinsics that are only available on
arm64) and are less error prone, so intrinsics should be preferred if
feasible.

However, you have still not explained how optimizing Adler32 makes a
difference for a real-world use case. Where is libdeflate used on a
hot path?