[PATCH v2] crypto: arm/chacha-neon - optimize for non-block size multiples

Eric Biggers ebiggers at kernel.org
Sat Dec 12 01:43:33 EST 2020


Hi Ard,

On Tue, Nov 03, 2020 at 05:28:09PM +0100, Ard Biesheuvel wrote:
> @@ -42,24 +42,24 @@ static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
>  {
>  	u8 buf[CHACHA_BLOCK_SIZE];
>  
> -	while (bytes >= CHACHA_BLOCK_SIZE * 4) {
> -		chacha_4block_xor_neon(state, dst, src, nrounds);
> -		bytes -= CHACHA_BLOCK_SIZE * 4;
> -		src += CHACHA_BLOCK_SIZE * 4;
> -		dst += CHACHA_BLOCK_SIZE * 4;
> -		state[12] += 4;
> -	}
> -	while (bytes >= CHACHA_BLOCK_SIZE) {
> -		chacha_block_xor_neon(state, dst, src, nrounds);
> -		bytes -= CHACHA_BLOCK_SIZE;
> -		src += CHACHA_BLOCK_SIZE;
> -		dst += CHACHA_BLOCK_SIZE;
> -		state[12]++;
> +	while (bytes > CHACHA_BLOCK_SIZE) {
> +		unsigned int l = min(bytes, CHACHA_BLOCK_SIZE * 4U);
> +
> +		chacha_4block_xor_neon(state, dst, src, nrounds, l);
> +		bytes -= l;
> +		src += l;
> +		dst += l;
> +		state[12] += DIV_ROUND_UP(l, CHACHA_BLOCK_SIZE);
>  	}
>  	if (bytes) {
> -		memcpy(buf, src, bytes);
> -		chacha_block_xor_neon(state, buf, buf, nrounds);
> -		memcpy(dst, buf, bytes);
> +		const u8 *s = src;
> +		u8 *d = dst;
> +
> +		if (bytes != CHACHA_BLOCK_SIZE)
> +			s = d = memcpy(buf, src, bytes);
> +		chacha_block_xor_neon(state, d, s, nrounds);
> +		if (d != dst)
> +			memcpy(dst, buf, bytes);
>  	}
>  }
>  

Shouldn't this be incrementing the block counter after chacha_block_xor_neon()?
It might be needed by the library API.

Also, even with that fixed, this patch is causing the self-tests (both the
chacha20poly1305_selftest(), and the crypto API tests for chacha20-neon,
xchacha20-neon, and xchacha12-neon) to fail when I boot a kernel in QEMU.  This
doesn't happen on real hardware (Raspberry Pi 2), and I don't see any other bugs
in this patch, so I'm not sure what the problem is.  Did you run the self-tests
on every platform you tested this on?

- Eric



More information about the linux-arm-kernel mailing list