[PATCH riscv/for-next] crypto: riscv - add vector crypto accelerated AES-CBC-CTS

Wed Feb 14 14:42:27 PST 2024

On Wed, Feb 14, 2024 at 05:34:03PM +0100, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 06:57, Eric Biggers <ebiggers at kernel.org> wrote:
> >
> > From: Eric Biggers <ebiggers at google.com>
> >
> > Add an implementation of cts(cbc(aes)) accelerated using the Zvkned
> > RISC-V vector crypto extension.  This is mainly useful for fscrypt,
> > where cts(cbc(aes)) is the "default" filenames encryption algorithm.  In
> > that use case, typically most messages are short and are block-aligned.
> 
> Does this mean the storage space for filenames is rounded up to AES block size?

Yes, in most cases.  fscrypt allows the filenames padding to be configured to be
4, 8, 16, or 32 bytes.  If it's 16 or 32, which is recommended, then the sizes
of encrypted filenames are multiples of the AES block size, except for filenames
longer than 240 bytes which get rounded up to 255 bytes.

> 
> > The CBC-CTS variant implemented is CS3; this is the variant Linux uses.
> >
> > To perform well on short messages, the new implementation processes the
> > full message in one call to the assembly function if the data is
> > contiguous.  Otherwise it falls back to CBC operations followed by CTS
> > at the end.  For decryption, to further improve performance on short
> > messages, especially block-aligned messages, the CBC-CTS assembly
> > function parallelizes the AES decryption of all full blocks.
> 
> Nice!
> 
> > This
> > improves on the arm64 implementation of cts(cbc(aes)), which always
> > splits the CBC part(s) from the CTS part, doing the AES decryptions for
> > the last two blocks serially and usually loading the round keys twice.
> >
> 
> So is the overhead of this sub-optimal approach mostly in the
> redundant loading of the round keys? Or are there other significant
> benefits?
> 
> If there are, I suppose we might port this improvement to x86 too, but
> otherwise, I guess it'll only make sense for arm64.

I expect that the serialization of the last two AES decryptions makes the
biggest difference, followed by the other sources of overhead (loading round
keys, skcipher_walk, kernel_neon_begin).  It needs to be measured, though.

I'd like to try the same optimization for arm64 and x86.  It's not fun going
back to SIMD after working with the RISC-V Vector Extension, though!

- Eric