[PATCH v3] riscv: Optimize crc32 with Zbc extension

Wed Mar 13 19:51:42 PDT 2024

On Thu, Mar 14, 2024 at 02:32:57AM +0000, Wang, Xiao W wrote:
> 
> 
> > -----Original Message-----
> > From: Charlie Jenkins <charlie at rivosinc.com>
> > Sent: Thursday, March 14, 2024 6:47 AM
> > To: Wang, Xiao W <xiao.w.wang at intel.com>
> > Cc: paul.walmsley at sifive.com; palmer at dabbelt.com;
> > aou at eecs.berkeley.edu; ajones at ventanamicro.com;
> > conor.dooley at microchip.com; heiko at sntech.de; david.laight at aculab.com;
> > Li, Haicheng <haicheng.li at intel.com>; linux-riscv at lists.infradead.org; linux-
> > kernel at vger.kernel.org
> > Subject: Re: [PATCH v3] riscv: Optimize crc32 with Zbc extension
> > 
> > On Wed, Mar 13, 2024 at 11:21:39AM +0800, Xiao Wang wrote:
> > > As suggested by the B-ext spec, the Zbc (carry-less multiplication)
> > > instructions can be used to accelerate CRC calculations. Currently, the
> > > crc32 is the most widely used crc function inside kernel, so this patch
> > > focuses on the optimization of just the crc32 APIs.
> > >
> > > Compared with the current table-lookup based optimization, Zbc based
> > > optimization can also achieve large stride during CRC calculation loop,
> > > meantime, it avoids the memory access latency of the table-lookup based
> > > implementation and it reduces memory footprint.
> > >
> > > If Zbc feature is not supported in a runtime environment, then the
> > > table-lookup based implementation would serve as fallback via alternative
> > > mechanism.
> > >
> > > By inspecting the vmlinux built by gcc v12.2.0 with default optimization
> > > level (-O2), we can see below instruction count change for each 8-byte
> > > stride in the CRC32 loop:
> > >
> > > rv64: crc32_be (54->31), crc32_le (54->13), __crc32c_le (54->13)
> > > rv32: crc32_be (50->32), crc32_le (50->16), __crc32c_le (50->16)
> > 
> > Even though this loop is optimized, there are a lot of other
> > instructions being executed else where for these tests. When running the
> > test-case in QEMU with ZBC enabled, I get these results:
> > 
> > [    0.353444] crc32: CRC_LE_BITS = 64, CRC_BE BITS = 64
> > [    0.353470] crc32: self tests passed, processed 225944 bytes in 2044700
> > nsec
> > [    0.354098] crc32c: CRC_LE_BITS = 64
> > [    0.354114] crc32c: self tests passed, processed 112972 bytes in 289000
> > nsec
> > [    0.387204] crc32_combine: 8373 self tests passed
> > [    0.419881] crc32c_combine: 8373 self tests passed
> > 
> > Then when running with ZBC disabled I get:
> > 
> > [    0.351331] crc32: CRC_LE_BITS = 64, CRC_BE BITS = 64
> > [    0.351359] crc32: self tests passed, processed 225944 bytes in 567500
> > nsec
> > [    0.352071] crc32c: CRC_LE_BITS = 64
> > [    0.352090] crc32c: self tests passed, processed 112972 bytes in 289900
> > nsec
> > [    0.385395] crc32_combine: 8373 self tests passed
> > [    0.418180] crc32c_combine: 8373 self tests passed
> > 
> > This is QEMU so it's not a perfect representation of hardware, but being
> > 4 times slower with ZBC seems suspicious. I ran these tests numerous
> > times and got similar results. Do you know why these tests would perform
> > 4 times better without ZBC?
> 
> ZBC instruction' functionality is relatively more complex, so QEMU tcg uses the
> helper function mechanism to emulate these ZBC instructions. Helper function
> gets called for each ZBC instruction within tcg JIT code, which is inefficient. I see
> similar issue about the Vector extension, the optimized RVV implementation runs
> actually much slower than the scalar implementation on QEMU tcg.

Okay I will take your word for it :)

> 
> > 
> > >
> > > The compile target CPU is little endian, extra effort is needed for byte
> > > swapping for the crc32_be API, thus, the instruction count change is not
> > > as significant as that in the *_le cases.
> > >
> > > This patch is tested on QEMU VM with the kernel CRC32 selftest for both
> > > rv64 and rv32.
> > >
> > > Signed-off-by: Xiao Wang <xiao.w.wang at intel.com>
> > > ---
> > > v3:
> > > - Use Zbc to handle also the data head and tail bytes, instead of calling
> > >   fallback function.
> > > - Misc changes due to the new design.
> > >
> > > v2:
> > > - Fix sparse warnings about type casting. (lkp)
> > > - Add info about instruction count change in commit log. (Andrew)
> > > - Use the min() helper from linux/minmax.h. (Andrew)
> > > - Use "#if __riscv_xlen == 64" macro check to differentiate rv64 and rv32.
> > (Andrew)
> > > - Line up several macro values by tab. (Andrew)
> > > - Make poly_qt as "unsigned long" to unify the code for rv64 and rv32.
> > (David)
> > > - Fix the style of comment wing. (Andrew)
> > > - Add function wrappers for the asm code for the *_le cases. (Andrew)
> > > ---
> > >  arch/riscv/Kconfig      |  23 ++++
> > >  arch/riscv/lib/Makefile |   1 +
> > >  arch/riscv/lib/crc32.c  | 294
> 
> [...]
> > > +static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
> > > +					  size_t len, u32 poly,
> > > +					  unsigned long poly_qt,
> > > +					  fallback crc_fb)
> > > +{
> > > +	size_t offset, head_len, tail_len;
> > > +	unsigned long const *p_ul;
> > > +	unsigned long s;
> > > +
> > > +	asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > 
> > This needs to be changed to be asm goto:
> > 
> > 4356e9f841f7f ("work around gcc bugs with 'asm goto' with outputs")
> > 
> 
> Thanks for the pointer. Will change it.
> 
> > > +				      RISCV_ISA_EXT_ZBC, 1)
> > > +			  : : : : legacy);
> > > +
> > > +	/* Handle the unaligned head. */
> > > +	offset = (unsigned long)p & OFFSET_MASK;
> > > +	if (offset && len) {
> > 
> > If len is 0 nothing in the function seems like it will modify crc. Is there
> > a reason to not break out immediately if len is 0?
> > 
> 
> Yeah, if len is 0, then crc won't be modified.
> Normally in scenarios like hash value calculation and packets CRC check,
> the "len" can hardly be zero. And software usually avoids unaligned buf
> addr, which means the "offset" here mostly is false.
> 
> So if we add a "len == 0" check at the beginning of this function, it will
> introduce a branch overhead for the most cases.

That makes sense thank you. 

> 
> > > +		head_len = min(STEP - offset, len);
> > > +		crc = crc32_le_unaligned(crc, p, head_len, poly, poly_qt);
> > > +		p += head_len;
> > > +		len -= head_len;
> > > +	}
> > > +
> > > +	tail_len = len & OFFSET_MASK;
> > > +	len = len >> STEP_ORDER;
> > > +	p_ul = (unsigned long const *)p;
> > > +
> > > +	for (int i = 0; i < len; i++) {
> > > +		s = crc32_le_prep(crc, p_ul);
> > > +		crc = crc32_le_zbc(s, poly, poly_qt);
> > > +		p_ul++;
> > > +	}
> > > +
> > > +	/* Handle the tail bytes. */
> > > +	p = (unsigned char const *)p_ul;
> > > +	if (tail_len)
> > > +		crc = crc32_le_unaligned(crc, p, tail_len, poly, poly_qt);
> > > +
> > > +	return crc;
> > > +
> > > +legacy:
> > > +	return crc_fb(crc, p, len);
> > > +}
> > > +
> > > +u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
> > > +{
> > > +	return crc32_le_generic(crc, p, len, CRC32_POLY_LE,
> > CRC32_POLY_QT_LE,
> > > +				crc32_le_base);
> > > +}
> > > +
> > > +u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
> > > +{
> > > +	return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
> > > +				CRC32C_POLY_QT_LE, __crc32c_le_base);
> > > +}
> > > +
> > > +static inline u32 crc32_be_unaligned(u32 crc, unsigned char const *p,
> > > +				     size_t len)
> > > +{
> > > +	size_t bits = len * 8;
> > > +	unsigned long s = 0;
> > > +	u32 crc_low = 0;
> > > +
> > > +	s = 0;
> > > +	for (int i = 0; i < len; i++)
> > > +		s = *p++ | (s << 8);
> > > +
> > > +	if (__riscv_xlen == 32 || len < sizeof(u32)) {
> > > +		s ^= crc >> (32 - bits);
> > > +		crc_low = crc << bits;
> > > +	} else {
> > > +		s ^= (unsigned long)crc << (bits - 32);
> > > +	}
> > > +
> > > +	crc = crc32_be_zbc(s);
> > > +	crc ^= crc_low;
> > > +
> > > +	return crc;
> > > +}
> > > +
> > > +u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
> > > +{
> > > +	size_t offset, head_len, tail_len;
> > > +	unsigned long const *p_ul;
> > > +	unsigned long s;
> > > +
> > > +	asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > 
> > Same here
> > 
> 
> Will change it.
> 
> Thanks for the comments.
> -Xiao
> 

I am not familiar with this algorithm but this does seem like it should
show an improvement in hardware with ZBC, so there is no reason to hold
this from being merged.

When you change the asm goto so it will compile with 6.8 you can add my
tag:

Reviewed-by: Charlie Jenkins <charlie at rivosinc.com>

- Charlie