[PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

Tue Jun 11 13:18:58 PDT 2024

On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> 
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
> 
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf

No, not even close.  Even assuming that the lack of parallelizability in AES-CBC
and AES-XCBC can be entirely compensated for via multibuffer crypto (which
really it can't -- consider single packets, for example), doing AES twice is
much more expensive than doing AES and GHASH.  GHASH is a universal hash
function, and computing a universal hash function is inherently cheaper than
computing a cryptographic hash function.  But also modern Intel CPUs have very
fast carryless multiplication, and it uses a different execution port from what
AES uses.  So the overhead of AES + GHASH over AES alone is very small.  By
doing AES twice, you'd be entirely bottlenecked by the ports that can execute
the AES instructions, while the other ports go nearly unused.  So it would
probably be approaching twice as slow as AES-GCM.

Westmere (2010) through Ivy Bridge (2012) are the only Intel CPUs where
multibuffer AES-CBC-XCBC could plausibly be faster than AES-GCM (given a
sufficiently large number of messages at once), due to the very slow pclmulqdq
instruction on those CPUs.  This is long since fixed, as pclmulqdq became much
faster in Haswell (2013), and faster still in Broadwell.  This is exactly what
that Intel paper shows; they show AES-GCM becoming fastest in "Gen 4", i.e.
Haswell.  The paper is from 2012, so of course they don't show anything after
that.  But AES-GCM has only pulled ahead even more since then.

In theory something like AES-CBC + SHA-256 could be slightly more competitive
than AES-CBC + AES-XCBC.  But it would still be worse than simply doing AES-GCM
-- which again, doesn't need multibuffer, and my recent patches have already
fully optimized for recent x86_64 CPUs.

- Eric