[PATCH v6 5/6] crypto: arm64/aes-ccm - reduce NEON begin/end calls for common case

Wed May 26 10:14:08 PDT 2021

On Wed, May 26, 2021 at 12:07:28PM +0200, Ard Biesheuvel wrote:
> AES-CCM (as used in WPA2 CCMP, for instance) typically involves
> authenticate-only data, and operates on a single network packet, and so
> the common case is for the authenticate, en/decrypt and finalize SIMD
> helpers to all be called exactly once in sequence. Since
> kernel_neon_end() now involves manipulation of the preemption state as
> well as the softirq mask state, let's reduce the number of times we are
> forced to call it to only once if we are handling this common case.
> 
> Signed-off-by: Ard Biesheuvel <ardb at kernel.org>
> ---
>  arch/arm64/crypto/aes-ce-ccm-core.S |  1 +
>  arch/arm64/crypto/aes-ce-ccm-glue.c | 74 +++++++++++---------
>  2 files changed, 43 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
> index 99a028e298ed..8adff299fcd3 100644
> --- a/arch/arm64/crypto/aes-ce-ccm-core.S
> +++ b/arch/arm64/crypto/aes-ce-ccm-core.S
> @@ -124,6 +124,7 @@ SYM_FUNC_START(ce_aes_ccm_final)
>  SYM_FUNC_END(ce_aes_ccm_final)
>  
>  	.macro	aes_ccm_do_crypt,enc
> +	cbz	x2, 5f
>  	ldr	x8, [x6, #8]			/* load lower ctr */
>  	ld1	{v0.16b}, [x5]			/* load mac */
>  CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
> diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
> index 54bd2494a000..98159f2c49ae 100644
> --- a/arch/arm64/crypto/aes-ce-ccm-glue.c
> +++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
> @@ -97,10 +97,8 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
>  static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
>  			   u32 abytes, u32 *macp)
>  {
> -	kernel_neon_begin();
>  	ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
>  			     num_rounds(key));
> -	kernel_neon_end();
>  }
[...]
> +	if (req->assoclen)
> +		ccm_calculate_auth_mac(req, mac);
> +

This still makes all the associated data be processed under a single
kernel_neon_begin() / kernel_neon_end() pair, even if there is a large amount of
it.  Shouldn't it be limited to a reasonable amount at a time, like 4K?
This sort of thing has been considered a bug before, e.g. see
commit 706024a52c6 ("crypto: arch/lib - limit simd usage to 4k chunks").

You could do the entire CCM operation under a single pair as long as there isn't
more than 4K of associated data.

- Eric