[PATCH 2/2] [v2] crypto: sha1: add ARM NEON implementation
Ard Biesheuvel
ard.biesheuvel at linaro.org
Mon Jun 30 01:20:36 PDT 2014
On 29 June 2014 16:33, Jussi Kivilinna <jussi.kivilinna at iki.fi> wrote:
> This patch adds ARM NEON assembly implementation of SHA-1 algorithm.
>
> tcrypt benchmark results on Cortex-A8, sha1-arm-asm vs sha1-neon-asm:
>
> block-size bytes/update old-vs-new
> 16 16 1.04x
> 64 16 1.02x
> 64 64 1.05x
> 256 16 1.03x
> 256 64 1.04x
> 256 256 1.30x
> 1024 16 1.03x
> 1024 256 1.36x
> 1024 1024 1.52x
> 2048 16 1.03x
> 2048 256 1.39x
> 2048 1024 1.55x
> 2048 2048 1.59x
> 4096 16 1.03x
> 4096 256 1.40x
> 4096 1024 1.57x
> 4096 4096 1.62x
> 8192 16 1.03x
> 8192 256 1.40x
> 8192 1024 1.58x
> 8192 4096 1.63x
> 8192 8192 1.63x
>
> Changes in v2:
> - Use ENTRY/ENDPROC
> - Don't provide Thumb2 version
> - Move contants to .text section
> - Further tweaks to implementation for ~10% speed-up.
>
Please move the changelog to below the '---' so it doesn't end up in
the kernel commit log.
> Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Acked-by: Ard Biesheuvel <ard.biesheuvel at linaro.org>
Tested-by: Ard Biesheuvel <ard.biesheuvel at linaro.org>
Tested on Exynos-5250 (Cortex-A15)
ARM asm
=======
[ 1478.699012] testing speed of sha1
[ 1478.699040] test 0 ( 16 byte blocks, 16 bytes per update, 1
updates): 873594 opers/sec, 13977514 bytes/sec
[ 1481.694959] test 1 ( 64 byte blocks, 16 bytes per update, 4
updates): 386415 opers/sec, 24730581 bytes/sec
[ 1484.694958] test 2 ( 64 byte blocks, 64 bytes per update, 1
updates): 543196 opers/sec, 34764586 bytes/sec
[ 1487.694959] test 3 ( 256 byte blocks, 16 bytes per update, 16
updates): 141109 opers/sec, 36123989 bytes/sec
[ 1490.694959] test 4 ( 256 byte blocks, 64 bytes per update, 4
updates): 218391 opers/sec, 55908266 bytes/sec
[ 1493.694958] test 5 ( 256 byte blocks, 256 bytes per update, 1
updates): 256225 opers/sec, 65593685 bytes/sec
[ 1496.694959] test 6 ( 1024 byte blocks, 16 bytes per update, 64
updates): 39845 opers/sec, 40801280 bytes/sec
[ 1499.694973] test 7 ( 1024 byte blocks, 256 bytes per update, 4
updates): 78594 opers/sec, 80480597 bytes/sec
[ 1502.694966] test 8 ( 1024 byte blocks, 1024 bytes per update, 1
updates): 83790 opers/sec, 85801642 bytes/sec
[ 1505.694966] test 9 ( 2048 byte blocks, 16 bytes per update, 128
updates): 20204 opers/sec, 41379157 bytes/sec
[ 1508.694989] test 10 ( 2048 byte blocks, 256 bytes per update, 8
updates): 41075 opers/sec, 84121600 bytes/sec
[ 1511.694979] test 11 ( 2048 byte blocks, 1024 bytes per update, 2
updates): 43358 opers/sec, 88797184 bytes/sec
[ 1514.694960] test 12 ( 2048 byte blocks, 2048 bytes per update, 1
updates): 44168 opers/sec, 90457429 bytes/sec
[ 1517.694968] test 13 ( 4096 byte blocks, 16 bytes per update, 256
updates): 10331 opers/sec, 42315776 bytes/sec
[ 1520.694967] test 14 ( 4096 byte blocks, 256 bytes per update, 16
updates): 21004 opers/sec, 86032384 bytes/sec
[ 1523.694955] test 15 ( 4096 byte blocks, 1024 bytes per update, 4
updates): 22193 opers/sec, 90903893 bytes/sec
[ 1526.694989] test 16 ( 4096 byte blocks, 4096 bytes per update, 1
updates): 22671 opers/sec, 92860416 bytes/sec
[ 1529.695000] test 17 ( 8192 byte blocks, 16 bytes per update, 512
updates): 5192 opers/sec, 42538325 bytes/sec
[ 1532.695110] test 18 ( 8192 byte blocks, 256 bytes per update, 32
updates): 10628 opers/sec, 87067306 bytes/sec
[ 1535.695015] test 19 ( 8192 byte blocks, 1024 bytes per update, 8
updates): 11233 opers/sec, 92026197 bytes/sec
[ 1538.694997] test 20 ( 8192 byte blocks, 4096 bytes per update, 2
updates): 11393 opers/sec, 93334186 bytes/sec
[ 1541.694980] test 21 ( 8192 byte blocks, 8192 bytes per update, 1
updates): 11427 opers/sec, 93615445 bytes/sec
ARM neon
========
[ 1582.519068] testing speed of sha1
[ 1582.519097] test 0 ( 16 byte blocks, 16 bytes per update, 1
updates): 900970 opers/sec, 14415520 bytes/sec
[ 1585.514959] test 1 ( 64 byte blocks, 16 bytes per update, 4
updates): 406465 opers/sec, 26013802 bytes/sec
[ 1588.514961] test 2 ( 64 byte blocks, 64 bytes per update, 1
updates): 579712 opers/sec, 37101610 bytes/sec
[ 1591.514958] test 3 ( 256 byte blocks, 16 bytes per update, 16
updates): 139189 opers/sec, 35632554 bytes/sec
[ 1594.514964] test 4 ( 256 byte blocks, 64 bytes per update, 4
updates): 234671 opers/sec, 60075861 bytes/sec
[ 1597.514960] test 5 ( 256 byte blocks, 256 bytes per update, 1
updates): 347872 opers/sec, 89055402 bytes/sec
[ 1600.514959] test 6 ( 1024 byte blocks, 16 bytes per update, 64
updates): 38385 opers/sec, 39306922 bytes/sec
[ 1603.514968] test 7 ( 1024 byte blocks, 256 bytes per update, 4
updates): 113441 opers/sec, 116163584 bytes/sec
[ 1606.514963] test 8 ( 1024 byte blocks, 1024 bytes per update, 1
updates): 134316 opers/sec, 137539925 bytes/sec
[ 1609.514964] test 9 ( 2048 byte blocks, 16 bytes per update, 128
updates): 19514 opers/sec, 39966037 bytes/sec
[ 1612.514957] test 10 ( 2048 byte blocks, 256 bytes per update, 8
updates): 59782 opers/sec, 122434901 bytes/sec
[ 1615.514958] test 11 ( 2048 byte blocks, 1024 bytes per update, 2
updates): 71359 opers/sec, 146144597 bytes/sec
[ 1618.514958] test 12 ( 2048 byte blocks, 2048 bytes per update, 1
updates): 73938 opers/sec, 151425024 bytes/sec
[ 1621.514968] test 13 ( 4096 byte blocks, 16 bytes per update, 256
updates): 9844 opers/sec, 40322389 bytes/sec
[ 1624.514998] test 14 ( 4096 byte blocks, 256 bytes per update, 16
updates): 30744 opers/sec, 125928789 bytes/sec
[ 1627.514987] test 15 ( 4096 byte blocks, 1024 bytes per update, 4
updates): 36904 opers/sec, 151161514 bytes/sec
[ 1630.514973] test 16 ( 4096 byte blocks, 4096 bytes per update, 1
updates): 38912 opers/sec, 159383552 bytes/sec
[ 1633.514966] test 17 ( 8192 byte blocks, 16 bytes per update, 512
updates): 4937 opers/sec, 40449365 bytes/sec
[ 1636.515082] test 18 ( 8192 byte blocks, 256 bytes per update, 32
updates): 15598 opers/sec, 127781546 bytes/sec
[ 1639.515021] test 19 ( 8192 byte blocks, 1024 bytes per update, 8
updates): 18776 opers/sec, 153818453 bytes/sec
[ 1642.514978] test 20 ( 8192 byte blocks, 4096 bytes per update, 2
updates): 19809 opers/sec, 162278058 bytes/sec
[ 1645.514997] test 21 ( 8192 byte blocks, 8192 bytes per update, 1
updates): 19819 opers/sec, 162362709 bytes/sec
> ---
> arch/arm/crypto/Makefile | 2
> arch/arm/crypto/sha1-armv7-neon.S | 634 ++++++++++++++++++++++++++++++++++++
> arch/arm/crypto/sha1_glue.c | 8
> arch/arm/crypto/sha1_neon_glue.c | 197 +++++++++++
> arch/arm/include/asm/crypto/sha1.h | 10 +
> crypto/Kconfig | 11 +
> 6 files changed, 859 insertions(+), 3 deletions(-)
> create mode 100644 arch/arm/crypto/sha1-armv7-neon.S
> create mode 100644 arch/arm/crypto/sha1_neon_glue.c
> create mode 100644 arch/arm/include/asm/crypto/sha1.h
>
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 81cda39..374956d 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -5,10 +5,12 @@
> obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
> obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
> obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
> +obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>
> aes-arm-y := aes-armv4.o aes_glue.o
> aes-arm-bs-y := aesbs-core.o aesbs-glue.o
> sha1-arm-y := sha1-armv4-large.o sha1_glue.o
> +sha1-arm-neon-y := sha1-armv7-neon.o sha1_neon_glue.o
>
> quiet_cmd_perl = PERL $@
> cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/sha1-armv7-neon.S b/arch/arm/crypto/sha1-armv7-neon.S
> new file mode 100644
> index 0000000..50013c0
> --- /dev/null
> +++ b/arch/arm/crypto/sha1-armv7-neon.S
> @@ -0,0 +1,634 @@
> +/* sha1-armv7-neon.S - ARM/NEON accelerated SHA-1 transform function
> + *
> + * Copyright © 2013-2014 Jussi Kivilinna <jussi.kivilinna at iki.fi>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include <linux/linkage.h>
> +
> +
> +.syntax unified
> +.code 32
> +.fpu neon
> +
> +.text
> +
> +
> +/* Context structure */
> +
> +#define state_h0 0
> +#define state_h1 4
> +#define state_h2 8
> +#define state_h3 12
> +#define state_h4 16
> +
> +
> +/* Constants */
> +
> +#define K1 0x5A827999
> +#define K2 0x6ED9EBA1
> +#define K3 0x8F1BBCDC
> +#define K4 0xCA62C1D6
> +.align 4
> +.LK_VEC:
> +.LK1: .long K1, K1, K1, K1
> +.LK2: .long K2, K2, K2, K2
> +.LK3: .long K3, K3, K3, K3
> +.LK4: .long K4, K4, K4, K4
> +
> +
> +/* Register macros */
> +
> +#define RSTATE r0
> +#define RDATA r1
> +#define RNBLKS r2
> +#define ROLDSTACK r3
> +#define RWK lr
> +
> +#define _a r4
> +#define _b r5
> +#define _c r6
> +#define _d r7
> +#define _e r8
> +
> +#define RT0 r9
> +#define RT1 r10
> +#define RT2 r11
> +#define RT3 r12
> +
> +#define W0 q0
> +#define W1 q1
> +#define W2 q2
> +#define W3 q3
> +#define W4 q4
> +#define W5 q5
> +#define W6 q6
> +#define W7 q7
> +
> +#define tmp0 q8
> +#define tmp1 q9
> +#define tmp2 q10
> +#define tmp3 q11
> +
> +#define qK1 q12
> +#define qK2 q13
> +#define qK3 q14
> +#define qK4 q15
> +
> +
> +/* Round function macros. */
> +
> +#define WK_offs(i) (((i) & 15) * 4)
> +
> +#define _R_F1(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + ldr RT3, [sp, WK_offs(i)]; \
> + pre1(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + bic RT0, d, b; \
> + add e, e, a, ror #(32 - 5); \
> + and RT1, c, b; \
> + pre2(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + add RT0, RT0, RT3; \
> + add e, e, RT1; \
> + ror b, #(32 - 30); \
> + pre3(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + add e, e, RT0;
> +
> +#define _R_F2(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + ldr RT3, [sp, WK_offs(i)]; \
> + pre1(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + eor RT0, d, b; \
> + add e, e, a, ror #(32 - 5); \
> + eor RT0, RT0, c; \
> + pre2(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + add e, e, RT3; \
> + ror b, #(32 - 30); \
> + pre3(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + add e, e, RT0; \
> +
> +#define _R_F3(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + ldr RT3, [sp, WK_offs(i)]; \
> + pre1(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + eor RT0, b, c; \
> + and RT1, b, c; \
> + add e, e, a, ror #(32 - 5); \
> + pre2(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + and RT0, RT0, d; \
> + add RT1, RT1, RT3; \
> + add e, e, RT0; \
> + ror b, #(32 - 30); \
> + pre3(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
> + add e, e, RT1;
> +
> +#define _R_F4(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + _R_F2(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28)
> +
> +#define _R(a,b,c,d,e,f,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + _R_##f(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28)
> +
> +#define R(a,b,c,d,e,f,i) \
> + _R_##f(a,b,c,d,e,i,dummy,dummy,dummy,i16,\
> + W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28)
> +
> +#define dummy(...)
> +
> +
> +/* Input expansion macros. */
> +
> +/********* Precalc macros for rounds 0-15 *************************************/
> +
> +#define W_PRECALC_00_15() \
> + add RWK, sp, #(WK_offs(0)); \
> + \
> + vld1.32 {tmp0, tmp1}, [RDATA]!; \
> + vrev32.8 W0, tmp0; /* big => little */ \
> + vld1.32 {tmp2, tmp3}, [RDATA]!; \
> + vadd.u32 tmp0, W0, curK; \
> + vrev32.8 W7, tmp1; /* big => little */ \
> + vrev32.8 W6, tmp2; /* big => little */ \
> + vadd.u32 tmp1, W7, curK; \
> + vrev32.8 W5, tmp3; /* big => little */ \
> + vadd.u32 tmp2, W6, curK; \
> + vst1.32 {tmp0, tmp1}, [RWK]!; \
> + vadd.u32 tmp3, W5, curK; \
> + vst1.32 {tmp2, tmp3}, [RWK]; \
> +
> +#define WPRECALC_00_15_0(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vld1.32 {tmp0, tmp1}, [RDATA]!; \
> +
> +#define WPRECALC_00_15_1(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + add RWK, sp, #(WK_offs(0)); \
> +
> +#define WPRECALC_00_15_2(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vrev32.8 W0, tmp0; /* big => little */ \
> +
> +#define WPRECALC_00_15_3(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vld1.32 {tmp2, tmp3}, [RDATA]!; \
> +
> +#define WPRECALC_00_15_4(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp0, W0, curK; \
> +
> +#define WPRECALC_00_15_5(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vrev32.8 W7, tmp1; /* big => little */ \
> +
> +#define WPRECALC_00_15_6(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vrev32.8 W6, tmp2; /* big => little */ \
> +
> +#define WPRECALC_00_15_7(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp1, W7, curK; \
> +
> +#define WPRECALC_00_15_8(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vrev32.8 W5, tmp3; /* big => little */ \
> +
> +#define WPRECALC_00_15_9(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp2, W6, curK; \
> +
> +#define WPRECALC_00_15_10(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vst1.32 {tmp0, tmp1}, [RWK]!; \
> +
> +#define WPRECALC_00_15_11(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp3, W5, curK; \
> +
> +#define WPRECALC_00_15_12(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vst1.32 {tmp2, tmp3}, [RWK]; \
> +
> +
> +/********* Precalc macros for rounds 16-31 ************************************/
> +
> +#define WPRECALC_16_31_0(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor tmp0, tmp0; \
> + vext.8 W, W_m16, W_m12, #8; \
> +
> +#define WPRECALC_16_31_1(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + add RWK, sp, #(WK_offs(i)); \
> + vext.8 tmp0, W_m04, tmp0, #4; \
> +
> +#define WPRECALC_16_31_2(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor tmp0, tmp0, W_m16; \
> + veor.32 W, W, W_m08; \
> +
> +#define WPRECALC_16_31_3(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor tmp1, tmp1; \
> + veor W, W, tmp0; \
> +
> +#define WPRECALC_16_31_4(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vshl.u32 tmp0, W, #1; \
> +
> +#define WPRECALC_16_31_5(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vext.8 tmp1, tmp1, W, #(16-12); \
> + vshr.u32 W, W, #31; \
> +
> +#define WPRECALC_16_31_6(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vorr tmp0, tmp0, W; \
> + vshr.u32 W, tmp1, #30; \
> +
> +#define WPRECALC_16_31_7(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vshl.u32 tmp1, tmp1, #2; \
> +
> +#define WPRECALC_16_31_8(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor tmp0, tmp0, W; \
> +
> +#define WPRECALC_16_31_9(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor W, tmp0, tmp1; \
> +
> +#define WPRECALC_16_31_10(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp0, W, curK; \
> +
> +#define WPRECALC_16_31_11(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vst1.32 {tmp0}, [RWK];
> +
> +
> +/********* Precalc macros for rounds 32-79 ************************************/
> +
> +#define WPRECALC_32_79_0(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor W, W_m28; \
> +
> +#define WPRECALC_32_79_1(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vext.8 tmp0, W_m08, W_m04, #8; \
> +
> +#define WPRECALC_32_79_2(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor W, W_m16; \
> +
> +#define WPRECALC_32_79_3(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + veor W, tmp0; \
> +
> +#define WPRECALC_32_79_4(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + add RWK, sp, #(WK_offs(i&~3)); \
> +
> +#define WPRECALC_32_79_5(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vshl.u32 tmp1, W, #2; \
> +
> +#define WPRECALC_32_79_6(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vshr.u32 tmp0, W, #30; \
> +
> +#define WPRECALC_32_79_7(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vorr W, tmp0, tmp1; \
> +
> +#define WPRECALC_32_79_8(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vadd.u32 tmp0, W, curK; \
> +
> +#define WPRECALC_32_79_9(i,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
> + vst1.32 {tmp0}, [RWK];
> +
> +
> +/*
> + * Transform nblks*64 bytes (nblks*16 32-bit words) at DATA.
> + *
> + * unsigned int
> + * sha1_transform_neon (void *ctx, const unsigned char *data,
> + * unsigned int nblks)
> + */
> +.align 3
> +ENTRY(sha1_transform_neon)
> + /* input:
> + * r0: ctx, CTX
> + * r1: data (64*nblks bytes)
> + * r2: nblks
> + */
> +
> + cmp RNBLKS, #0;
> + beq .Ldo_nothing;
> +
> + push {r4-r12, lr};
> + /*vpush {q4-q7};*/
> +
> + adr RT3, .LK_VEC;
> +
> + mov ROLDSTACK, sp;
> +
> + /* Align stack. */
> + sub RT0, sp, #(16*4);
> + and RT0, #(~(16-1));
> + mov sp, RT0;
> +
> + vld1.32 {qK1-qK2}, [RT3]!; /* Load K1,K2 */
> +
> + /* Get the values of the chaining variables. */
> + ldm RSTATE, {_a-_e};
> +
> + vld1.32 {qK3-qK4}, [RT3]; /* Load K3,K4 */
> +
> +#undef curK
> +#define curK qK1
> + /* Precalc 0-15. */
> + W_PRECALC_00_15();
> +
> +.Loop:
> + /* Transform 0-15 + Precalc 16-31. */
> + _R( _a, _b, _c, _d, _e, F1, 0,
> + WPRECALC_16_31_0, WPRECALC_16_31_1, WPRECALC_16_31_2, 16,
> + W4, W5, W6, W7, W0, _, _, _ );
> + _R( _e, _a, _b, _c, _d, F1, 1,
> + WPRECALC_16_31_3, WPRECALC_16_31_4, WPRECALC_16_31_5, 16,
> + W4, W5, W6, W7, W0, _, _, _ );
> + _R( _d, _e, _a, _b, _c, F1, 2,
> + WPRECALC_16_31_6, WPRECALC_16_31_7, WPRECALC_16_31_8, 16,
> + W4, W5, W6, W7, W0, _, _, _ );
> + _R( _c, _d, _e, _a, _b, F1, 3,
> + WPRECALC_16_31_9, WPRECALC_16_31_10,WPRECALC_16_31_11,16,
> + W4, W5, W6, W7, W0, _, _, _ );
> +
> +#undef curK
> +#define curK qK2
> + _R( _b, _c, _d, _e, _a, F1, 4,
> + WPRECALC_16_31_0, WPRECALC_16_31_1, WPRECALC_16_31_2, 20,
> + W3, W4, W5, W6, W7, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F1, 5,
> + WPRECALC_16_31_3, WPRECALC_16_31_4, WPRECALC_16_31_5, 20,
> + W3, W4, W5, W6, W7, _, _, _ );
> + _R( _e, _a, _b, _c, _d, F1, 6,
> + WPRECALC_16_31_6, WPRECALC_16_31_7, WPRECALC_16_31_8, 20,
> + W3, W4, W5, W6, W7, _, _, _ );
> + _R( _d, _e, _a, _b, _c, F1, 7,
> + WPRECALC_16_31_9, WPRECALC_16_31_10,WPRECALC_16_31_11,20,
> + W3, W4, W5, W6, W7, _, _, _ );
> +
> + _R( _c, _d, _e, _a, _b, F1, 8,
> + WPRECALC_16_31_0, WPRECALC_16_31_1, WPRECALC_16_31_2, 24,
> + W2, W3, W4, W5, W6, _, _, _ );
> + _R( _b, _c, _d, _e, _a, F1, 9,
> + WPRECALC_16_31_3, WPRECALC_16_31_4, WPRECALC_16_31_5, 24,
> + W2, W3, W4, W5, W6, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F1, 10,
> + WPRECALC_16_31_6, WPRECALC_16_31_7, WPRECALC_16_31_8, 24,
> + W2, W3, W4, W5, W6, _, _, _ );
> + _R( _e, _a, _b, _c, _d, F1, 11,
> + WPRECALC_16_31_9, WPRECALC_16_31_10,WPRECALC_16_31_11,24,
> + W2, W3, W4, W5, W6, _, _, _ );
> +
> + _R( _d, _e, _a, _b, _c, F1, 12,
> + WPRECALC_16_31_0, WPRECALC_16_31_1, WPRECALC_16_31_2, 28,
> + W1, W2, W3, W4, W5, _, _, _ );
> + _R( _c, _d, _e, _a, _b, F1, 13,
> + WPRECALC_16_31_3, WPRECALC_16_31_4, WPRECALC_16_31_5, 28,
> + W1, W2, W3, W4, W5, _, _, _ );
> + _R( _b, _c, _d, _e, _a, F1, 14,
> + WPRECALC_16_31_6, WPRECALC_16_31_7, WPRECALC_16_31_8, 28,
> + W1, W2, W3, W4, W5, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F1, 15,
> + WPRECALC_16_31_9, WPRECALC_16_31_10,WPRECALC_16_31_11,28,
> + W1, W2, W3, W4, W5, _, _, _ );
> +
> + /* Transform 16-63 + Precalc 32-79. */
> + _R( _e, _a, _b, _c, _d, F1, 16,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 32,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _d, _e, _a, _b, _c, F1, 17,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 32,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _c, _d, _e, _a, _b, F1, 18,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 32,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _b, _c, _d, _e, _a, F1, 19,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 32,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> +
> + _R( _a, _b, _c, _d, _e, F2, 20,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 36,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _e, _a, _b, _c, _d, F2, 21,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 36,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _d, _e, _a, _b, _c, F2, 22,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 36,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _c, _d, _e, _a, _b, F2, 23,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 36,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> +
> +#undef curK
> +#define curK qK3
> + _R( _b, _c, _d, _e, _a, F2, 24,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 40,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _a, _b, _c, _d, _e, F2, 25,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 40,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _e, _a, _b, _c, _d, F2, 26,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 40,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _d, _e, _a, _b, _c, F2, 27,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 40,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> +
> + _R( _c, _d, _e, _a, _b, F2, 28,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 44,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _b, _c, _d, _e, _a, F2, 29,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 44,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _a, _b, _c, _d, _e, F2, 30,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 44,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _e, _a, _b, _c, _d, F2, 31,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 44,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> +
> + _R( _d, _e, _a, _b, _c, F2, 32,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 48,
> + W4, W5, W6, W7, W0, W1, W2, W3);
> + _R( _c, _d, _e, _a, _b, F2, 33,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 48,
> + W4, W5, W6, W7, W0, W1, W2, W3);
> + _R( _b, _c, _d, _e, _a, F2, 34,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 48,
> + W4, W5, W6, W7, W0, W1, W2, W3);
> + _R( _a, _b, _c, _d, _e, F2, 35,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 48,
> + W4, W5, W6, W7, W0, W1, W2, W3);
> +
> + _R( _e, _a, _b, _c, _d, F2, 36,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 52,
> + W3, W4, W5, W6, W7, W0, W1, W2);
> + _R( _d, _e, _a, _b, _c, F2, 37,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 52,
> + W3, W4, W5, W6, W7, W0, W1, W2);
> + _R( _c, _d, _e, _a, _b, F2, 38,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 52,
> + W3, W4, W5, W6, W7, W0, W1, W2);
> + _R( _b, _c, _d, _e, _a, F2, 39,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 52,
> + W3, W4, W5, W6, W7, W0, W1, W2);
> +
> + _R( _a, _b, _c, _d, _e, F3, 40,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 56,
> + W2, W3, W4, W5, W6, W7, W0, W1);
> + _R( _e, _a, _b, _c, _d, F3, 41,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 56,
> + W2, W3, W4, W5, W6, W7, W0, W1);
> + _R( _d, _e, _a, _b, _c, F3, 42,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 56,
> + W2, W3, W4, W5, W6, W7, W0, W1);
> + _R( _c, _d, _e, _a, _b, F3, 43,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 56,
> + W2, W3, W4, W5, W6, W7, W0, W1);
> +
> +#undef curK
> +#define curK qK4
> + _R( _b, _c, _d, _e, _a, F3, 44,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 60,
> + W1, W2, W3, W4, W5, W6, W7, W0);
> + _R( _a, _b, _c, _d, _e, F3, 45,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 60,
> + W1, W2, W3, W4, W5, W6, W7, W0);
> + _R( _e, _a, _b, _c, _d, F3, 46,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 60,
> + W1, W2, W3, W4, W5, W6, W7, W0);
> + _R( _d, _e, _a, _b, _c, F3, 47,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 60,
> + W1, W2, W3, W4, W5, W6, W7, W0);
> +
> + _R( _c, _d, _e, _a, _b, F3, 48,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 64,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _b, _c, _d, _e, _a, F3, 49,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 64,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _a, _b, _c, _d, _e, F3, 50,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 64,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> + _R( _e, _a, _b, _c, _d, F3, 51,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 64,
> + W0, W1, W2, W3, W4, W5, W6, W7);
> +
> + _R( _d, _e, _a, _b, _c, F3, 52,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 68,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _c, _d, _e, _a, _b, F3, 53,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 68,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _b, _c, _d, _e, _a, F3, 54,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 68,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> + _R( _a, _b, _c, _d, _e, F3, 55,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 68,
> + W7, W0, W1, W2, W3, W4, W5, W6);
> +
> + _R( _e, _a, _b, _c, _d, F3, 56,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 72,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _d, _e, _a, _b, _c, F3, 57,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 72,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _c, _d, _e, _a, _b, F3, 58,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 72,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> + _R( _b, _c, _d, _e, _a, F3, 59,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 72,
> + W6, W7, W0, W1, W2, W3, W4, W5);
> +
> + subs RNBLKS, #1;
> +
> + _R( _a, _b, _c, _d, _e, F4, 60,
> + WPRECALC_32_79_0, WPRECALC_32_79_1, WPRECALC_32_79_2, 76,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _e, _a, _b, _c, _d, F4, 61,
> + WPRECALC_32_79_3, WPRECALC_32_79_4, WPRECALC_32_79_5, 76,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _d, _e, _a, _b, _c, F4, 62,
> + WPRECALC_32_79_6, dummy, WPRECALC_32_79_7, 76,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> + _R( _c, _d, _e, _a, _b, F4, 63,
> + WPRECALC_32_79_8, dummy, WPRECALC_32_79_9, 76,
> + W5, W6, W7, W0, W1, W2, W3, W4);
> +
> + beq .Lend;
> +
> + /* Transform 64-79 + Precalc 0-15 of next block. */
> +#undef curK
> +#define curK qK1
> + _R( _b, _c, _d, _e, _a, F4, 64,
> + WPRECALC_00_15_0, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F4, 65,
> + WPRECALC_00_15_1, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _e, _a, _b, _c, _d, F4, 66,
> + WPRECALC_00_15_2, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _d, _e, _a, _b, _c, F4, 67,
> + WPRECALC_00_15_3, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> +
> + _R( _c, _d, _e, _a, _b, F4, 68,
> + dummy, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _b, _c, _d, _e, _a, F4, 69,
> + dummy, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F4, 70,
> + WPRECALC_00_15_4, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _e, _a, _b, _c, _d, F4, 71,
> + WPRECALC_00_15_5, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> +
> + _R( _d, _e, _a, _b, _c, F4, 72,
> + dummy, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _c, _d, _e, _a, _b, F4, 73,
> + dummy, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _b, _c, _d, _e, _a, F4, 74,
> + WPRECALC_00_15_6, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _a, _b, _c, _d, _e, F4, 75,
> + WPRECALC_00_15_7, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> +
> + _R( _e, _a, _b, _c, _d, F4, 76,
> + WPRECALC_00_15_8, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _d, _e, _a, _b, _c, F4, 77,
> + WPRECALC_00_15_9, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _c, _d, _e, _a, _b, F4, 78,
> + WPRECALC_00_15_10, dummy, dummy, _, _, _, _, _, _, _, _, _ );
> + _R( _b, _c, _d, _e, _a, F4, 79,
> + WPRECALC_00_15_11, dummy, WPRECALC_00_15_12, _, _, _, _, _, _, _, _, _ );
> +
> + /* Update the chaining variables. */
> + ldm RSTATE, {RT0-RT3};
> + add _a, RT0;
> + ldr RT0, [RSTATE, #state_h4];
> + add _b, RT1;
> + add _c, RT2;
> + add _d, RT3;
> + add _e, RT0;
> + stm RSTATE, {_a-_e};
> +
> + b .Loop;
> +
> +.Lend:
> + /* Transform 64-79 */
> + R( _b, _c, _d, _e, _a, F4, 64 );
> + R( _a, _b, _c, _d, _e, F4, 65 );
> + R( _e, _a, _b, _c, _d, F4, 66 );
> + R( _d, _e, _a, _b, _c, F4, 67 );
> + R( _c, _d, _e, _a, _b, F4, 68 );
> + R( _b, _c, _d, _e, _a, F4, 69 );
> + R( _a, _b, _c, _d, _e, F4, 70 );
> + R( _e, _a, _b, _c, _d, F4, 71 );
> + R( _d, _e, _a, _b, _c, F4, 72 );
> + R( _c, _d, _e, _a, _b, F4, 73 );
> + R( _b, _c, _d, _e, _a, F4, 74 );
> + R( _a, _b, _c, _d, _e, F4, 75 );
> + R( _e, _a, _b, _c, _d, F4, 76 );
> + R( _d, _e, _a, _b, _c, F4, 77 );
> + R( _c, _d, _e, _a, _b, F4, 78 );
> + R( _b, _c, _d, _e, _a, F4, 79 );
> +
> + mov sp, ROLDSTACK;
> +
> + /* Update the chaining variables. */
> + ldm RSTATE, {RT0-RT3};
> + add _a, RT0;
> + ldr RT0, [RSTATE, #state_h4];
> + add _b, RT1;
> + add _c, RT2;
> + add _d, RT3;
> + /*vpop {q4-q7};*/
> + add _e, RT0;
> + stm RSTATE, {_a-_e};
> +
> + pop {r4-r12, pc};
> +
> +.Ldo_nothing:
> + bx lr
> +ENDPROC(sha1_transform_neon)
> diff --git a/arch/arm/crypto/sha1_glue.c b/arch/arm/crypto/sha1_glue.c
> index c494e57..84f2a75 100644
> --- a/arch/arm/crypto/sha1_glue.c
> +++ b/arch/arm/crypto/sha1_glue.c
> @@ -23,6 +23,7 @@
> #include <linux/types.h>
> #include <crypto/sha.h>
> #include <asm/byteorder.h>
> +#include <asm/crypto/sha1.h>
>
>
> asmlinkage void sha1_block_data_order(u32 *digest,
> @@ -65,8 +66,8 @@ static int __sha1_update(struct sha1_state *sctx, const u8 *data,
> }
>
>
> -static int sha1_update(struct shash_desc *desc, const u8 *data,
> - unsigned int len)
> +int sha1_update_arm(struct shash_desc *desc, const u8 *data,
> + unsigned int len)
> {
> struct sha1_state *sctx = shash_desc_ctx(desc);
> unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
> @@ -81,6 +82,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
> res = __sha1_update(sctx, data, len, partial);
> return res;
> }
> +EXPORT_SYMBOL_GPL(sha1_update_arm);
>
>
> /* Add padding and return the message digest. */
> @@ -135,7 +137,7 @@ static int sha1_import(struct shash_desc *desc, const void *in)
> static struct shash_alg alg = {
> .digestsize = SHA1_DIGEST_SIZE,
> .init = sha1_init,
> - .update = sha1_update,
> + .update = sha1_update_arm,
> .final = sha1_final,
> .export = sha1_export,
> .import = sha1_import,
> diff --git a/arch/arm/crypto/sha1_neon_glue.c b/arch/arm/crypto/sha1_neon_glue.c
> new file mode 100644
> index 0000000..6f1b411
> --- /dev/null
> +++ b/arch/arm/crypto/sha1_neon_glue.c
> @@ -0,0 +1,197 @@
> +/*
> + * Glue code for the SHA1 Secure Hash Algorithm assembler implementation using
> + * ARM NEON instructions.
> + *
> + * Copyright © 2014 Jussi Kivilinna <jussi.kivilinna at iki.fi>
> + *
> + * This file is based on sha1_generic.c and sha1_ssse3_glue.c:
> + * Copyright (c) Alan Smithee.
> + * Copyright (c) Andrew McDonald <andrew at mcdonald.org.uk>
> + * Copyright (c) Jean-Francois Dive <jef at linuxbe.org>
> + * Copyright (c) Mathias Krause <minipli at googlemail.com>
> + * Copyright (c) Chandramouli Narayanan <mouli at linux.intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +
> +#include <crypto/internal/hash.h>
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/cryptohash.h>
> +#include <linux/types.h>
> +#include <crypto/sha.h>
> +#include <asm/byteorder.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <asm/crypto/sha1.h>
> +
> +
> +asmlinkage void sha1_transform_neon(void *state_h, const char *data,
> + unsigned int rounds);
> +
> +
> +static int sha1_neon_init(struct shash_desc *desc)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> +
> + *sctx = (struct sha1_state){
> + .state = { SHA1_H0, SHA1_H1, SHA1_H2, SHA1_H3, SHA1_H4 },
> + };
> +
> + return 0;
> +}
> +
> +static int __sha1_neon_update(struct shash_desc *desc, const u8 *data,
> + unsigned int len, unsigned int partial)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> + unsigned int done = 0;
> +
> + sctx->count += len;
> +
> + if (partial) {
> + done = SHA1_BLOCK_SIZE - partial;
> + memcpy(sctx->buffer + partial, data, done);
> + sha1_transform_neon(sctx->state, sctx->buffer, 1);
> + }
> +
> + if (len - done >= SHA1_BLOCK_SIZE) {
> + const unsigned int rounds = (len - done) / SHA1_BLOCK_SIZE;
> +
> + sha1_transform_neon(sctx->state, data + done, rounds);
> + done += rounds * SHA1_BLOCK_SIZE;
> + }
> +
> + memcpy(sctx->buffer, data + done, len - done);
> +
> + return 0;
> +}
> +
> +static int sha1_neon_update(struct shash_desc *desc, const u8 *data,
> + unsigned int len)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> + unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
> + int res;
> +
> + /* Handle the fast case right here */
> + if (partial + len < SHA1_BLOCK_SIZE) {
> + sctx->count += len;
> + memcpy(sctx->buffer + partial, data, len);
> +
> + return 0;
> + }
> +
> + if (!may_use_simd()) {
> + res = sha1_update_arm(desc, data, len);
> + } else {
> + kernel_neon_begin();
> + res = __sha1_neon_update(desc, data, len, partial);
> + kernel_neon_end();
> + }
> +
> + return res;
> +}
> +
> +
> +/* Add padding and return the message digest. */
> +static int sha1_neon_final(struct shash_desc *desc, u8 *out)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> + unsigned int i, index, padlen;
> + __be32 *dst = (__be32 *)out;
> + __be64 bits;
> + static const u8 padding[SHA1_BLOCK_SIZE] = { 0x80, };
> +
> + bits = cpu_to_be64(sctx->count << 3);
> +
> + /* Pad out to 56 mod 64 and append length */
> + index = sctx->count % SHA1_BLOCK_SIZE;
> + padlen = (index < 56) ? (56 - index) : ((SHA1_BLOCK_SIZE+56) - index);
> + if (!may_use_simd()) {
> + sha1_update_arm(desc, padding, padlen);
> + sha1_update_arm(desc, (const u8 *)&bits, sizeof(bits));
> + } else {
> + kernel_neon_begin();
> + /* We need to fill a whole block for __sha1_neon_update() */
> + if (padlen <= 56) {
> + sctx->count += padlen;
> + memcpy(sctx->buffer + index, padding, padlen);
> + } else {
> + __sha1_neon_update(desc, padding, padlen, index);
> + }
> + __sha1_neon_update(desc, (const u8 *)&bits, sizeof(bits), 56);
> + kernel_neon_end();
> + }
> +
> + /* Store state in digest */
> + for (i = 0; i < 5; i++)
> + dst[i] = cpu_to_be32(sctx->state[i]);
> +
> + /* Wipe context */
> + memset(sctx, 0, sizeof(*sctx));
> +
> + return 0;
> +}
> +
> +static int sha1_neon_export(struct shash_desc *desc, void *out)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> +
> + memcpy(out, sctx, sizeof(*sctx));
> +
> + return 0;
> +}
> +
> +static int sha1_neon_import(struct shash_desc *desc, const void *in)
> +{
> + struct sha1_state *sctx = shash_desc_ctx(desc);
> +
> + memcpy(sctx, in, sizeof(*sctx));
> +
> + return 0;
> +}
> +
> +static struct shash_alg alg = {
> + .digestsize = SHA1_DIGEST_SIZE,
> + .init = sha1_neon_init,
> + .update = sha1_neon_update,
> + .final = sha1_neon_final,
> + .export = sha1_neon_export,
> + .import = sha1_neon_import,
> + .descsize = sizeof(struct sha1_state),
> + .statesize = sizeof(struct sha1_state),
> + .base = {
> + .cra_name = "sha1",
> + .cra_driver_name = "sha1-neon",
> + .cra_priority = 250,
> + .cra_flags = CRYPTO_ALG_TYPE_SHASH,
> + .cra_blocksize = SHA1_BLOCK_SIZE,
> + .cra_module = THIS_MODULE,
> + }
> +};
> +
> +static int __init sha1_neon_mod_init(void)
> +{
> + if (!cpu_has_neon())
> + return -ENODEV;
> +
> + return crypto_register_shash(&alg);
> +}
> +
> +static void __exit sha1_neon_mod_fini(void)
> +{
> + crypto_unregister_shash(&alg);
> +}
> +
> +module_init(sha1_neon_mod_init);
> +module_exit(sha1_neon_mod_fini);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("SHA1 Secure Hash Algorithm, NEON accelerated");
> +MODULE_ALIAS("sha1");
> diff --git a/arch/arm/include/asm/crypto/sha1.h b/arch/arm/include/asm/crypto/sha1.h
> new file mode 100644
> index 0000000..75e6a41
> --- /dev/null
> +++ b/arch/arm/include/asm/crypto/sha1.h
> @@ -0,0 +1,10 @@
> +#ifndef ASM_ARM_CRYPTO_SHA1_H
> +#define ASM_ARM_CRYPTO_SHA1_H
> +
> +#include <linux/crypto.h>
> +#include <crypto/sha.h>
> +
> +extern int sha1_update_arm(struct shash_desc *desc, const u8 *data,
> + unsigned int len);
> +
> +#endif
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index 025c510..66d7ce1 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -540,6 +540,17 @@ config CRYPTO_SHA1_ARM
> SHA-1 secure hash standard (FIPS 180-1/DFIPS 180-2) implemented
> using optimized ARM assembler.
>
> +config CRYPTO_SHA1_ARM_NEON
> + tristate "SHA1 digest algorithm (ARM NEON)"
> + depends on ARM && KERNEL_MODE_NEON && !CPU_BIG_ENDIAN
> + select CRYPTO_SHA1_ARM
> + select CRYPTO_SHA1
> + select CRYPTO_HASH
> + help
> + SHA-1 secure hash standard (FIPS 180-1/DFIPS 180-2) implemented
> + using optimized ARM NEON assembly, when NEON instructions are
> + available.
> +
> config CRYPTO_SHA1_PPC
> tristate "SHA1 digest algorithm (powerpc)"
> depends on PPC
>
More information about the linux-arm-kernel
mailing list