[PATCH 0/5] crypto: add NEON-optimized BLAKE2b
Eric Biggers
ebiggers at kernel.org
Wed Dec 16 15:47:56 EST 2020
On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
> Patches 1-4 prepare for it by making some updates to the generic
> implementation, while patch 5 adds the actual NEON implementation.
>
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1. It is also almost three times
> as fast as the generic implementation of BLAKE2b:
>
> Algorithm Cycles per byte (on 4096-byte messages)
> =================== =======================================
> blake2b-256-neon 14.1
> sha1-neon 16.4
> sha1-asm 20.8
> blake2s-256-generic 26.1
> sha256-neon 28.9
> sha256-asm 32.1
> blake2b-256-generic 39.9
>
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S. At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).
>
> NEON-optimized BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
>
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s. This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it. The best I've been able to do with BLAKE2s
> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.
By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above. If it were to be added as "blake2s-256-arm", we'd have:
Algorithm Cycles per byte (on 4096-byte messages)
=================== =======================================
blake2b-256-neon 14.1
sha1-neon 16.4
blake2s-256-arm 19.0
sha1-asm 20.8
blake2s-256-generic 26.1
sha256-neon 28.9
sha256-asm 32.1
blake2b-256-generic 39.9
More information about the linux-arm-kernel
mailing list