[PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.
Ard Biesheuvel
ardb at kernel.org
Wed Nov 4 03:04:49 EST 2020
On Wed, 4 Nov 2020 at 09:02, Li Qiang <liqiang64 at huawei.com> wrote:
>
> Hi Ard,
>
> Thank you very much for your reply and comments on the code :)
>
> 在 2020/11/3 22:34, Ard Biesheuvel 写道:
> > (+ Dave)
> >
> > Hello liqiang,
> >
> > First of all, I don't think it is safe at the moment to use SVE in the
> > kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
> > though, so perhaps Dave can elaborate?
>
> OK, I understand this problem now.
>
> >
> > I'll give my feedback on the code itself below, but please don't
> > consider this an ack on the intent of using SVE in the kernel.
> >
> > Could you explain why SVE is so much better than NEON here?
>
> In the previous months of work, I spent some time researching ARM's SVE
> instruction set, and tried to use SVE instructions to optimize adler32
> and some other general algorithms, and achieved good optimization results
> on Adler32.
>
> According to my current research and debugging (see cover-letter), I think
> the vectorization method of Adler32 algorithm is very suitable for SVE
> implementation, because it can divide data blocks of any length, such as
> 16byte, 32byte, 128byte, 256byte, etc. so our code can adapt to any
> different SVE hardware from 128bit to 2048bit. Supporting SVE instructions
> with different bit widths does not require special changes and processing
> procedures. It only needs to determine the starting position of "Adler_sequence"
> according to the SVE bit width. And different hardware can give full play to
> its performance.
>
> I am also trying to implement the algorithm with NEON instructions. I will
> reply to you in time if there are results.
>
> >
> > On Tue, 3 Nov 2020 at 13:16, l00374334 <liqiang64 at huawei.com> wrote:
> >>
> >> From: liqiang <liqiang64 at huawei.com>
> >>
> >> In the libz library, the checksum algorithm adler32 usually occupies
> >> a relatively high hot spot, and the SVE instruction set can easily
> >> accelerate it, so that the performance of libz library will be
> >> significantly improved.
> >>
> >> We can divides buf into blocks according to the bit width of SVE,
> >> and then uses vector registers to perform operations in units of blocks
> >> to achieve the purpose of acceleration.
> >>
> >> On machines that support ARM64 sve instructions, this algorithm is
> >> about 3~4 times faster than the algorithm implemented in C language
> >> in libz. The wider the SVE instruction, the better the acceleration effect.
> >>
> >> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> >> below are the results of my measured random data of 1M and 10M:
> >>
> >> [root at xxx adler32]# ./benchmark 1000000
> >> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> >> SVE alg: Time used: 166 us, 6024.1 Mb/s.
> >>
> >> [root at xxx adler32]# ./benchmark 10000000
> >> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> >> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
> >>
> >> The blocks can be of any size, so the algorithm can automatically adapt
> >> to SVE hardware with different bit widths without modifying the code.
> >>
> >
> > Please drop this indentation from the commit log.
> >
> >>
> >> Signed-off-by: liqiang <liqiang64 at huawei.com>
> >> ---
> >> arch/arm64/crypto/Kconfig | 5 ++
> >> arch/arm64/crypto/Makefile | 3 +
> >> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
> >> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
> >> crypto/testmgr.c | 8 +-
> >> crypto/testmgr.h | 13 +++
> >
> > Please split into two patches. Also, who is going to use this "adler32" shash?
>
> In the kernel, adler32 is used by the zlib_deflate algorithm as a checksum algorithm,
> and the same is used in the libz library.
>
I understand that zlib_deflate uses adler32 internally. But where does
it invoke the crypto API to use the shash abstraction to perform this
transformation?
> >
> >> 6 files changed, 248 insertions(+), 1 deletion(-)
> >> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
> >> create mode 100644 arch/arm64/crypto/adler32-sve.S
> >>
> >> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> >> index b8eb045..cfe58b9 100644
> >> --- a/arch/arm64/crypto/Kconfig
> >> +++ b/arch/arm64/crypto/Kconfig
> >> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
> >> select CRYPTO_LIB_AES
> >> select CRYPTO_SIMD
> >>
> >> +config SVE_ADLER32
> >> + tristate "Accelerate Adler32 using arm64 SVE instructions."
> >> + depends on ARM64_SVE
> >> + select CRYPTO_HASH
> >> +
> >> endif
> >> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> >> index d0901e6..45fe649 100644
> >> --- a/arch/arm64/crypto/Makefile
> >> +++ b/arch/arm64/crypto/Makefile
> >> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
> >> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
> >> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
> >>
> >> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
> >> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
> >> +
> >> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
> >>
> >> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
> >> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
> >> new file mode 100644
> >> index 0000000..cb74514
> >> --- /dev/null
> >> +++ b/arch/arm64/crypto/adler32-sve-glue.c
> >> @@ -0,0 +1,93 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Accelerate Adler32 using arm64 SVE instructions.
> >> + * Automatically support all bit width of SVE
> >> + * vector(128~2048).
> >> + *
> >> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >> + *
> >> + * Author: Li Qiang <liqiang64 at huawei.com>
> >> + */
> >> +#include <linux/cpufeature.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/module.h>
> >> +#include <linux/zutil.h>
> >> +
> >> +#include <crypto/internal/hash.h>
> >> +#include <crypto/internal/simd.h>
> >> +
> >> +#include <asm/neon.h>
> >> +#include <asm/simd.h>
> >> +
> >> +/* Scalable vector extension min size 128bit */
> >> +#define SVE_ADLER32_MIN_SIZE 16U
> >> +#define SVE_ADLER32_DIGEST_SIZE 4
> >> +#define SVE_ADLER32_BLOCKSIZE 1
> >> +
> >> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
> >> +
> >> +static int adler32_sve_init(struct shash_desc *desc)
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + *adler32 = 1;
> >> + return 0;
> >> +}
> >> +
> >> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
> >> + unsigned int length)
> >
> > Please indent function parameters
> >
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
> >> + kernel_neon_begin();
> >> + *adler32 = adler32_sve(*adler32, data, length);
> >> + kernel_neon_end();
> >> + } else {
> >> + *adler32 = zlib_adler32(*adler32, data, length);
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + *(u32 *)out = *adler32;
> >
> > Please use put_unaligned here
> >
> >> + return 0;
> >> +}
> >> +
> >> +static struct shash_alg adler32_sve_alg[] = {{
> >> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
> >> + .descsize = SVE_ADLER32_DIGEST_SIZE,
> >> + .init = adler32_sve_init,
> >> + .update = adler32_sve_update,
> >> + .final = adler32_sve_final,
> >> +
> >> + .base.cra_name = "adler32",
> >> + .base.cra_driver_name = "adler32-arm64-sve",
> >> + .base.cra_priority = 200,
> >> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
> >> + .base.cra_module = THIS_MODULE,
> >
> > Please make sure the indentation is correct here.
> >
> >> +}};
> >> +
> >> +static int __init adler32_sve_mod_init(void)
> >> +{
> >> + if (!cpu_have_named_feature(SVE))
> >> + return 0;
> >> +
> >> + return crypto_register_shash(adler32_sve_alg);
> >> +}
> >> +
> >> +static void __exit adler32_sve_mod_exit(void)
> >> +{
> >> + crypto_unregister_shash(adler32_sve_alg);
> >> +}
> >> +
> >> +module_init(adler32_sve_mod_init);
> >> +module_exit(adler32_sve_mod_exit);
> >> +
> >> +MODULE_AUTHOR("Li Qiang <liqiang64 at huawei.com>");
> >> +MODULE_LICENSE("GPL v2");
> >> +MODULE_ALIAS_CRYPTO("adler32");
> >> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
> >> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
> >> new file mode 100644
> >> index 0000000..34ee4bb
> >> --- /dev/null
> >> +++ b/arch/arm64/crypto/adler32-sve.S
> >> @@ -0,0 +1,127 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
> >> + * width of SVE vector(128~2048).
> >> + *
> >> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >> + *
> >> + * Author: Li Qiang <liqiang64 at huawei.com>
> >> + */
> >> +
> >> +#include <linux/linkage.h>
> >> +#include <asm/assembler.h>
> >> +
> >> +.arch armv8-a+sve
> >> +.file "adler32_sve.S"
> >
> > Drop the .file
> >
> > Please indent the rest 1 tab
> >
> >> +.text
> >> +.align 6
> >> +
> >> +//The supported sve vector length range is 128~2048 by this Adler_sequence
> >> +.Adler_sequence:
> >
> > This should be in .rodata. Also, if you use . or L prefixes, use .L
> > because that is what you need to make these local symbols.
> >
> >
> >> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
> >> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
> >> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
> >> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
> >> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
> >> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
> >> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
> >> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
> >> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
> >> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
> >> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
> >> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
> >> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
> >> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
> >> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
> >> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
> >> +
> >> +SYM_FUNC_START(adler32_sve)
> >> + and w10, w0, #0xffff
> >> + lsr w11, w0, #16
> >> +
> >
> > Please put the instruction mnemonics in different columns using tabs
> >
> >> + // Get the length of the sve vector to x6.
> >> + mov x6, #0
> >> + mov x9, #256
> >> + addvl x6, x6, #1
> >> + adr x12, .Adler_sequence
> >> + ptrue p1.h
> >> +
> >> + // Get the starting position of the required sequence.
> >> + sub x9, x9, x6
> >> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
> >> + inch x9
> >> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
> >> + mov x9, #0
> >> + // A little of byte, jumper to normal proc
> >
> > What does this comment mean?
> >
> >> + mov x14, #3
> >> + mul x15, x14, x6
> >> + cmp x2, x15
> >> + b.le Lnormal_proc
> >> +
> >> + ptrue p0.b
> >> +.align 6
> >
> > Ident.
> >
> >> +LBig_loop:
> >
> > Use .L prefix
> >
> >> + // x is SVE vector length (byte).
> >> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
> >> + // An = An-1 + D1 + D2 + D3 + ... + Dx
> >> +
> >> + .macro ADLER_BLOCK_X
> >
> > Please use lower case for asm macros, to distinguish them from CPP macros
> > Also, indent the name, and move the macro out of the function for legibility.
> >
> >> + ld1b z0.b, p0/z, [x1, x9]
> >> + incb x9
> >> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
> >> + mov x12, v20.2d[0]
> >> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
> >> +
> >> + uunpklo z26.h, z0.b
> >> + uunpkhi z27.h, z0.b
> >> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
> >> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
> >> +
> >> + uaddv d21, p1, z26.h
> >> + uaddv d22, p1, z27.h
> >> + mov x13, v21.2d[0]
> >> + mov x14, v22.2d[0]
> >> +
> >> + add x11, x13, x11
> >> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
> >> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
> >> + .endm
> >> + ADLER_BLOCK_X
> >> + ADLER_BLOCK_X
> >> + ADLER_BLOCK_X
> >> + // calc = reg0 % 65521
> >> + .macro mod65521, reg0, reg1, reg2
> >> + mov w\reg1, #0x8071
> >> + mov w\reg2, #0xfff1
> >> + movk w\reg1, #0x8007, lsl #16
> >> + umull x\reg1, w\reg0, w\reg1
> >> + lsr x\reg1, x\reg1, #47
> >> + msub w\reg0, w\reg1, w\reg2, w\reg0
> >> + .endm
> >> +
> >
> > Same as above
> >
> >> + mod65521 10, 14, 12
> >> + mod65521 11, 14, 12
> >> +
> >> + sub x2, x2, x15
> >> + cmp x2, x15
> >> + b.ge LBig_loop
> >> +
> >> +.align 6
> >
> > Indent
> >
> >> +Lnormal_proc:
> >
> > .L
> >
> >
> >> + cmp x2, #0
> >> + b.eq Lret
> >> +
> >> + ldrb w12, [x1, x9]
> >> + add x9, x9, #1
> >> + add x10, x12, x10
> >> + add x11, x10, x11
> >> + sub x2, x2, #1
> >> + b Lnormal_proc
> >> +
> >> +Lret:
> >> + mod65521 10, 14, 12
> >> + mod65521 11, 14, 12
> >> + lsl x11, x11, #16
> >> + orr x0, x10, x11
> >> + ret
> >> +SYM_FUNC_END(adler32_sve)
> >> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> >> index a64a639..58b8020 100644
> >> --- a/crypto/testmgr.c
> >> +++ b/crypto/testmgr.c
> >> @@ -4174,6 +4174,13 @@ static const struct alg_test_desc alg_test_descs[] = {
> >> .suite = {
> >> .cipher = __VECS(adiantum_xchacha20_aes_tv_template)
> >> },
> >> + }, {
> >> + .alg = "adler32",
> >> + .test = alg_test_hash,
> >> + .fips_allowed = 1,
> >> + .suite = {
> >> + .hash = __VECS(adler32_tv_template)
> >> + }
> >> }, {
> >> .alg = "aegis128",
> >> .test = alg_test_aead,
> >> @@ -5640,7 +5647,6 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
> >> }
> >>
> >> DO_ONCE(testmgr_onetime_init);
> >> -
> >> if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> >> char nalg[CRYPTO_MAX_ALG_NAME];
> >>
> >> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> >> index 8c83811..5233960 100644
> >> --- a/crypto/testmgr.h
> >> +++ b/crypto/testmgr.h
> >> @@ -3676,6 +3676,19 @@ static const struct hash_testvec crct10dif_tv_template[] = {
> >> }
> >> };
> >>
> >> +static const struct hash_testvec adler32_tv_template[] = {
> >> + {
> >> + .plaintext = "abcde",
> >> + .psize = 5,
> >> + .digest = "\xf0\x01\xc8\x05",
> >> + },
> >> + {
> >> + .plaintext = "0123456789101112131415",
> >> + .psize = 22,
> >> + .digest = "\x63\x04\xa8\x32",
> >> + },
> >> +};
> >> +
> >> /*
> >> * Streebog test vectors from RFC 6986 and GOST R 34.11-2012
> >> */
> >> --
> >> 2.19.1
> >>
> > .
> >
>
> Thank you for your suggestions, I will modify them one by one in my code.
> :-)
>
> --
> Best regards,
> Li Qiang
More information about the linux-arm-kernel
mailing list