[PATCH] riscv: Optimize crc32 with Zbc extension

Tue Jan 23 21:44:52 PST 2024

Hi Andrew,

> -----Original Message-----
> From: Andrew Jones <ajones at ventanamicro.com>
> Sent: Tuesday, January 16, 2024 11:04 PM
> To: Wang, Xiao W <xiao.w.wang at intel.com>
> Cc: paul.walmsley at sifive.com; palmer at dabbelt.com;
> aou at eecs.berkeley.edu; conor.dooley at microchip.com; heiko at sntech.de; Li,
> Haicheng <haicheng.li at intel.com>; linux-riscv at lists.infradead.org; linux-
> kernel at vger.kernel.org
> Subject: Re: [PATCH] riscv: Optimize crc32 with Zbc extension
> 
> On Fri, Jan 05, 2024 at 04:08:30PM +0800, Xiao Wang wrote:
> > As suggested by the B-ext spec, the Zbc (carry-less multiplication)
> > instructions can be used to accelerate CRC calculations. Currently, the
> > crc32 is the most widely used crc function inside kernel, so this patch
> > focuses on the optimization of just the crc32 APIs.
> >
> > Compared with the current table-lookup based optimization, Zbc based
> > optimization can also achieve large stride during CRC calculation loop,
> > meantime, it avoids the memory access latency of the table-lookup based
> > implementation and it reduces memory footprint.
> >
> > If Zbc feature is not supported in a runtime environment, then the
> > table-lookup based implementation would serve as fallback via alternative
> > mechanism. To avoid the performance concern of unalignment access, we
> also
> > use the fallback implementation to handle the head and tail bytes if any,
> > the main body is accelerated by Zbc extension.
> >
> > This patch is tested on QEMU VM with the kernel CRC32 selftest.
> 
> Ideally we'd also have the results of some benchmarks, or at least a
> dynamic instruction count or something, but I don't suspect that the
> table-lookup would "win" over Zbc. The bigger question is whether we
> want three implementations, because I assume we want Zvbc, particularly
> since [1] says
> 
> """
> Zvbc is listed as a development option for use in other algorithms,
> and will become mandatory. Scalar Zbc is now listed as an expansion
> option, i.e., it will probably not become mandatory.
> """
> 
> [1] https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc

I would collect instructions count change and include the info in the commit log.
Regarding to the bigger question, I have some thoughts:

- It looks this question also stands for some other cases like lib/str*, which currently
has been accelerated by Zbb-ext, but since the in-kernel RVV support has been enabled,
we may also use RVV to accelerate the lib/str*.
ALTERNATIVE_2 mechanism allows a second code patching on the same location. So
two alternative versions for the same function can be supported.

- At this moment, I can't figure out how Zvbc can be used to better accelerate these
crc32 APIs which take just one data stream as input. I have rough idea for a batched
version API which takes multi data stream as input (a parallel implementation of below
Zbc acceleration), but not for these 3 APIs here.
BTW, the vector "clmulr" insn is not listed in the Zvbc spec, but it's useful in below implementation.
https://github.com/riscv/riscv-crypto/blob/main/doc/vector/riscv-crypto-vector-zvbc.adoc
Anyway, if we come up with a Zvbc based optimization in future, then ALTERNATIVE_2
would help.

> 
> 
> A few nits below.

Thanks for all the above & below comments. Will adopt most of the suggestions, except two,
please check below.

> 
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang at intel.com>
> > ---
> >  arch/riscv/Kconfig      |  23 +++++
> >  arch/riscv/lib/Makefile |   1 +
> >  arch/riscv/lib/crc32.c  | 216
> ++++++++++++++++++++++++++++++++++++++++
> >  include/linux/crc32.h   |   3 +
> >  4 files changed, 243 insertions(+)
> >  create mode 100644 arch/riscv/lib/crc32.c
> >
[...]

> > +#if (BITS_PER_LONG == 64)
> > +					  size_t len, u32 poly, u64 poly_qt,
> > +#else
> > +					  size_t len, u32 poly, u32 poly_qt,
> > +#endif
> 
> How about creating a new type for poly_qt, defined as u64 for xlen=64
> and u32 for xlen=32 to avoid the #ifdef?

Would make it as "unsigned long", as David's comment.

> 
> > +					  fallback crc_fb)
> > +{
> > +	size_t offset, head_len, tail_len;
> > +	const unsigned long *p_ul;
> > +	unsigned long s;
> > +
> > +	asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > +				      RISCV_ISA_EXT_ZBC, 1)
> > +			  : : : : legacy);
> > +
> > +	/* Handle the unalignment head. */
> > +	offset = (unsigned long)p & OFFSET_MASK;
> > +	if (offset) {
> > +		head_len = MIN(STEP - offset, len);
> > +		crc = crc_fb(crc, p, head_len);
> > +		len -= head_len;
> > +		p += head_len;
> > +	}
> > +
> > +	tail_len = len & OFFSET_MASK;
> > +	len = len >> STEP_ORDER;
> > +	p_ul = (unsigned long *)p;
> > +
> > +	for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > +		s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
> > +		/* We don't have a "clmulrh" insn, so use clmul + slli instead.
> > +		 */
> 
> need opening /* comment wing
> 
> > +		asm volatile (".option push\n"
> > +			      ".option arch,+zbc\n"
> > +			      "clmul	%0, %1, %2\n"
> > +			      "slli	%0, %0, 1\n"
> > +			      "xor	%0, %0, %1\n"
> > +			      "clmulr	%0, %0, %3\n"
> > +			      "srli	%0, %0, 32\n"
> > +			      ".option pop\n"
> > +			      : "=&r" (crc)
> > +			      : "r" (s),
> > +				"r" (poly_qt),
> > +				"r" ((u64)poly << 32)
> > +			      :);
> > +#else
> > +		s = crc ^ __cpu_to_le32(*p_ul++);
> > +		/* We don't have a "clmulrh" insn, so use clmul + slli instead.
> 
> also here
> 
> > +		 */
> > +		asm volatile (".option push\n"
> > +			      ".option arch,+zbc\n"
> > +			      "clmul	%0, %1, %2\n"
> > +			      "slli	%0, %0, 1\n"
> > +			      "xor	%0, %0, %1\n"
> > +			      "clmulr	%0, %0, %3\n"
> > +			      ".option pop\n"
> > +			      : "=&r" (crc)
> > +			      : "r" (s),
> > +				"r" (poly_qt),
> > +				"r" (poly)
> > +			      :);
> > +#endif
> 
> Instead of the #ifdef here, we could provide function wrappers for the asm
> which would be defined above in the first #ifdef ladder.
> 
> > +	}
> > +
> > +	/* Handle the tail bytes. */
> > +	if (tail_len)
> > +		crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
> > +	return crc;
> > +
> > +legacy:
> > +	return crc_fb(crc, p, len);
> > +}
> > +
> > +u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > +	return crc32_le_generic(crc, p, len, CRC32_POLY_LE,
> CRC32_POLY_QT_LE,
> > +				crc32_le_base);
> > +}
> > +
> > +u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > +	return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
> > +				CRC32C_POLY_QT_LE, __crc32c_le_base);
> > +}
> > +
> > +u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
> > +{
> > +	size_t offset, head_len, tail_len;
> > +	const unsigned long *p_ul;
> > +	unsigned long s;
> > +
> > +	asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > +				      RISCV_ISA_EXT_ZBC, 1)
> > +			  : : : : legacy);
> > +
> > +	/* Handle the unalignment head. */
> > +	offset = (unsigned long)p & OFFSET_MASK;
> > +	if (offset) {
> > +		head_len = MIN(STEP - offset, len);
> > +		crc = crc32_be_base(crc, p, head_len);
> > +		len -= head_len;
> > +		p += head_len;
> > +	}
> > +
> > +	tail_len = len & OFFSET_MASK;
> > +	len = len >> STEP_ORDER;
> > +	p_ul = (unsigned long *)p;
> > +
> > +	for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > +		s = (unsigned long)crc << 32;
> > +		s ^= __cpu_to_be64(*p_ul++);
> > +#else
> > +		s = crc ^ __cpu_to_be32(*p_ul++);
> > +#endif
> 
> Could write the above without #ifdef with
> 
>  if (IS_ENABLED(CONFIG_64BIT)) {
>     ...
>  } else {
>     ...
>  }

I got a warning for riscv32 build with this change, gcc v12.2.0
../arch/riscv/lib/crc32.c:191:40: warning: left shift count >= width of type [-Wshift-count-overflow]
It looks compiler takes the always-impossible code branch into consideration.

BRs,
Xiao