[PATCHv3 10/10] x86/crypto: add pclmul acceleration for crc64
David Laight
David.Laight at ACULAB.COM
Tue Feb 22 09:02:16 PST 2022
From: Keith Busch
> Sent: 22 February 2022 16:32
>
> The crc64 table lookup method is inefficient, using a significant number
> of CPU cycles in the block stack per IO. If available on x86, use a
> PCLMULQDQ implementation to accelerate the calculation.
>
> The assembly from this patch was mostly generated by gcc from a C
> program using library functions provided by x86 intrinsics, and measures
> ~20x faster than the table lookup.
I think I'd like to see the C code and compiler options used to
generate the assembler as comments in the committed source file.
Either that or reasonable comments in the assembler.
It is also quite a lot of code.
What is the break-even length for 'cold cache' including the FPU saves.
...
> +.section .rodata
> +.align 32
> +.type shuffleMasks, @object
> +.size shuffleMasks, 32
> +shuffleMasks:
> + .string ""
> + .ascii "\001\002\003\004\005\006\007\b\t\n\013\f\r\016\017\217\216\215"
> + .ascii "\214\213\212\211\210\207\206\205\204\203\202\201\200"
That has to be the worst way to define 32 bytes.
> +.section .rodata.cst16,"aM", at progbits,16
> +.align 16
> +.LC0:
> + .quad -1523270018343381984
> + .quad 2443614144669557164
> + .align 16
> +.LC1:
> + .quad 2876949357237608311
> + .quad 3808117099328934763
Not sure what those are, but I bet there are better ways to
define/describe them.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
More information about the Linux-nvme
mailing list