[PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest

Doug Anderson dianders at chromium.org
Tue Sep 22 20:39:24 EDT 2020


On Mon, Sep 21, 2020 at 11:25 PM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> On Tue, 22 Sep 2020 at 02:27, Douglas Anderson <dianders at chromium.org> wrote:
> >
> > On every boot time we see messages like this:
> >
> > [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> > [    0.025363] xor: measuring software checksum speed
> > [    0.035351]    8regs     :  3952.000 MB/sec
> > [    0.045384]    32regs    :  4860.000 MB/sec
> > [    0.055418]    arm64_neon:  5900.000 MB/sec
> > [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> >
> > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > again, the arm64_neon implementation is the fastest way to do XOR.
> > ...and the above is on a system with HZ=1000.  Due to the way the
> > testing happens, if we have HZ defined to something slower it'll take
> > much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> > a fact that will be the same for every bootup.
> >
> > Trying to super-optimize the xor operation makes a lot of sense if
> > you're using software RAID, but the above is probably not worth it for
> > most Linux users because:
> > 1. Quite a few arm64 kernels are built for embedded systems where
> >    software raid isn't common.  That means we're spending lots of time
> >    on every boot trying to optimize something we don't use.
> > 2. Presumably, if we have neon, it's faster than alternatives.  If
> >    it's not, it's not expected to be tons slower.
> > 3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
> >    existing test is somewhat misguided because it's assuming that test
> >    results on the boot CPU apply to the other CPUs in the system.
> >    This is not necessarily the case.
> >
> > Let's add a new config option that allows us to just use the neon
> > functions (if present) without benchmarking.
> >
> > NOTE: One small side effect is that on an arm64 system _without_ neon
> > we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
> > versions of the function.  That's presumably OK since we already test
> > all those when KERNEL_MODE_NEON is disabled.
> >
> > ALSO NOTE: presumably the way to do better than this is to add some
> > sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
> > XOR function each time xor is called.  Without seeing evidence that
> > this would really help someone, though, that doesn't seem worth it.
> >
> > Signed-off-by: Douglas Anderson <dianders at chromium.org>
>
> On the two arm64 machines that I happen to have running right now, I get
>
> SynQuacer (Cortex-A53)
>
>     8regs     :  1917.000 MB/sec
>     32regs    :  2270.000 MB/sec
>     arm64_neon:  2053.000 MB/sec
>
> ThunderX2
>
>     8regs     : 10170.000 MB/sec
>     32regs    : 12051.000 MB/sec
>     arm64_neon: 10948.000 MB/sec
>
> so your assertion is not entirely valid.

OK, good to know.


> If the system does not need XOR, it is free not to load the module, so
> there is no reason it has to affect the boot time.

The fact that it was run super early somehow made me just assume that
this couldn't be a module, but of course you're right that it can be a
module.  That works for me and saves me my precious boot time.  ;-)

That being said, this'll still bite anyone who wants to build this in
for whatever reason.  I'll respond to your other email with more...



More information about the linux-arm-kernel mailing list