[PATCH 2/9] lib/crypto: polyval: Add POLYVAL library

Tue Nov 11 11:46:03 PST 2025

On Tue, Nov 11, 2025 at 08:42:29AM +0100, Ard Biesheuvel wrote:
> On Mon, 10 Nov 2025 at 16:21, Ard Biesheuvel <ardb at kernel.org> wrote:
> >
> > Hi,
> >
> > On Mon, 10 Nov 2025 at 00:49, Eric Biggers <ebiggers at kernel.org> wrote:
> > >
> > > Add support for POLYVAL to lib/crypto/.
> > >
> > > This will replace the polyval crypto_shash algorithm and its use in the
> > > hctr2 template, simplifying the code and reducing overhead.
> > >
> > > Specifically, this commit introduces the POLYVAL library API and a
> > > generic implementation of it.  Later commits will migrate the existing
> > > architecture-optimized implementations of POLYVAL into lib/crypto/ and
> > > add a KUnit test suite.
> > >
> > > I've also rewritten the generic implementation completely, using a more
> > > modern approach instead of the traditional table-based approach.  It's
> > > now constant-time, requires no precomputation or dynamic memory
> > > allocations, decreases the per-key memory usage from 4096 bytes to 16
> > > bytes, and is faster than the old polyval-generic even on bulk data
> > > reusing the same key (at least on x86_64, where I measured 15% faster).
> > > We should do this for GHASH too, but for now just do it for POLYVAL.
> > >
> >
> > Very nice.
> >
> > GHASH might suffer on 32-bit, I suppose, but taking this approach at
> > least on 64-bit also for GHASH would be a huge improvement.
> >
> > I had a stab at replacing the int128 arithmetic with
> > __builtin_bitreverse64(), but it seems to make little difference (and
> > GCC does not support it [yet]). I've tried both arm64 and x86, and the
> > perf delta (using your kunit benchmark) is negligible in either case.
> 
> Sigh. I intended to only apply the generic patch and the kunit test,
> but applied the whole series in the end, which explains perfectly why
> x86_64 and arm64 performance are identical, given that the generic
> code isn't even used.
> 
> So trying this again, on a Cortex-A72 without Crypto Extensions, I do
> get a ~30% performance improvement doing the below. I haven't
> re-tested x86, but given that it does not appear to have a native
> scalar bit reverse instruction (or __builtin_bitreverse64() is broken
> for it), there is probably no point in finding out.
> 
> Not saying we should do this for POLYVAL, but something to keep in
> mind for gf128mul.c perhaps.
> 
> 
> --- a/lib/crypto/polyval.c
> +++ b/lib/crypto/polyval.c
> @@ -42,11 +42,48 @@
>   * 256-bit => 128-bit reduction algorithm.
>   */
> 
> -#ifdef CONFIG_ARCH_SUPPORTS_INT128
> +#if defined(CONFIG_ARCH_SUPPORTS_INT128) ||
> __has_builtin(__builtin_bitreverse64)
> 
>  /* Do a 64 x 64 => 128 bit carryless multiplication. */
>  static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
>  {
> +       u64 a0 = a & 0x1111111111111111;
> +       u64 a1 = a & 0x2222222222222222;
> +       u64 a2 = a & 0x4444444444444444;
> +       u64 a3 = a & 0x8888888888888888;
> +
> +       u64 b0 = b & 0x1111111111111111;
> +       u64 b1 = b & 0x2222222222222222;
> +       u64 b2 = b & 0x4444444444444444;
> +       u64 b3 = b & 0x8888888888888888;
> +
> +#if __has_builtin(__builtin_bitreverse64)
> +#define brev64 __builtin_bitreverse64
> +       u64 c0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
> +       u64 c1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
> +       u64 c2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
> +       u64 c3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
> +
> +       a0 = brev64(a0);
> +       a1 = brev64(a1);
> +       a2 = brev64(a2);
> +       a3 = brev64(a3);
> +
> +       b0 = brev64(b0);
> +       b1 = brev64(b1);
> +       b2 = brev64(b2);
> +       b3 = brev64(b3);
> +
> +       u64 d0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
> +       u64 d1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
> +       u64 d2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
> +       u64 d3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
> +
> +       *out_hi = ((brev64(d0) >> 1) & 0x1111111111111111) ^
> +                 ((brev64(d1) >> 1) & 0x2222222222222222) ^
> +                 ((brev64(d2) >> 1) & 0x4444444444444444) ^
> +                 ((brev64(d3) >> 1) & 0x8888888888888888);

Yeah, that's an interesting idea!  So if we bit-reflect the inputs, do
an n x n => n multiplication, and bit-reflect the output and right-shift
it by 1, we get the high half of the desired n x n => 2n multiplication.
(This relies on the fact that carries are being discarded.)  Then we
don't need an instruction that does an n x n => 2n multiplication or
produces the high half of it.

The availability of hardware bit-reversal is limited, though.  arm32,
arm64, and mips32r6 have it.  But all of those also have a "multiply
high" instruction.  So the 30% performance improvement you saw on arm64
seems surprising to me, as umulh should have been used.  (I verified
that it's indeed used in the generated asm with both gcc and clang.)

The available bit-reversal abstractions aren't too great either, with
__builtin_bitreverse64() being clang-specific and <linux/bitrev.h>
having a table-based, i.e. non-constant-time, fallback.  So presumably
we'd need to add our own which is guaranteed to use the actual
instructions and not some slow and/or table-based fallback.

I'll definitely look into this more later when bringing this improvement
to GHASH too.  But for now I think we should go with the version I have
in my patch.

- Eric