[PATCH 2/2] bitops: rotate: Add riscv implementation using Zbb extension

Wed Jul 2 03:11:35 PDT 2025

On Tue, 1 Jul 2025 14:32:14 -0400
Yury Norov <yury.norov at gmail.com> wrote:

> On Tue, Jul 01, 2025 at 08:47:01PM +0800, cp0613 at linux.alibaba.com wrote:
> > On Mon, 30 Jun 2025 12:53:03 -0400, yury.norov at gmail.com wrote:
> >   
> > > > > 1. The most trivial compile-case is actually evaluated at compile time; and
> > > > > 2. Any arch-specific code is well explained; and
> > > > > 3. legacy case optimized just as well as non-legacy.  
> > > > 
> > > > 1. As in the above reply, use the generic implementation when compile-time evaluation
> > > >    is possible。
> > > > 2. I will improve the comments later.  
> > > 
> > > I'm particularly interested in ror8/rol8 case:
> > > 
> > >         u32 word32 = ((u32)word << 24) | ((u32)word << 16) | ((u32)word << 8) | word;
> > > 
> > > When you expand it to 32-bit word, and want to rotate, you obviously
> > > need to copy lower quarterword to the higher one:
> > > 
> > >         0xab >> 0xab0000ab
> > > 
> > > That way generic (u8)ror32(0xab, shift) would work. But I don't understand
> > > why you copy the lower 8 bits to inner quarterwords. Is that a hardware
> > > requirement? Can you point to any arch documentation 
> > >   
> > > > 3. As mentioned before, only 8-bit rotation should have no optimization effect, and
> > > >    16-bit and above will have significant optimization.  
> > > 
> > > I asked you about the asm goto ("legacy") thing: you calculate that
> > > complex word32 _before_ evaluating the goto. So this word32 may get
> > > unused, and you waste cycles. I want to make sure this isn't the case.  
> > 
> > Thank you for your suggestion and reminder. This is not a hardware requirement, but it  
> 
> Sure. Please add
> 
> Suggested-by: Yury Norov (NVIDIA) <yury.norov at gmail.com>
> 
> > is somewhat related because the zbb instruction only supports word-sized rotate [1].
> > Considering the case where the shift is greater than 8, if the modulus of the shift is
> > not taken, it is required to continuously concatenate 8-bit data into 32 bits.

I'd not worry about rotates of 8 bits or more (for ror8).
They can be treated as 'undefined behaviour' under the assumption they don't happen.
The 'generic' version needs them to get gcc to generate a 'rorb' on x86.
The negated shift needs masking so that clang doesn't throw the code away when
the value is constant.

> > 
> > So, I considered the case of taking the remainder of shift and found that this
> > implementation would have one less instruction, so the final implementation is as follows:
> > ```
> > static inline u8 variable_rol8(u8 word, unsigned int shift)  
> 
> Now that it has assembler inlines, would it help to add the __pure
> qualifier?
> 
> > {
> > 	u32 word32;
> > 
> > 	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > 				      RISCV_ISA_EXT_ZBB, 1)
> > 			  : : : : legacy);
> > 
> > 	word32 = ((u32)word << 24) | word;
> > 	shift = shift % 8;  
> 
>         shift %= 8;
> 
> > 
> > 	asm volatile(
> > 		".option push\n"
> > 		".option arch,+zbb\n"
> > 		"rolw %0, %1, %2\n"
> > 		".option pop\n"
> > 		: "=r" (word32) : "r" (word32), "r" (shift) :);
> > 
> > 	return (u8)word32;
> > 
> > legacy:
> > 	return generic_rol8(word, shift);
> > }
> > 
> > static inline u8 variable_ror8(u8 word, unsigned int shift)
> > {
> > 	u32 word32;
> > 
> > 	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > 				      RISCV_ISA_EXT_ZBB, 1)
> > 			  : : : : legacy);
> > 
> > 	word32 = ((u32)word << 8) | word;
> > 	shift = shift % 8;
> > 
> > 	asm volatile(
> > 		".option push\n"
> > 		".option arch,+zbb\n"
> > 		"rorw %0, %1, %2\n"
> > 		".option pop\n"
> > 		: "=r" (word32) : "r" (word32), "r" (shift) :);
> > 
> > 	return (u8)word32;
> > 
> > legacy:
> > 	return generic_ror8(word, shift);
> > }
> > ```  
> 
> OK, this looks better.
> 
> > I compared the performance of ror8 (zbb optimized) and generic_ror8 on the XUANTIE C908
> > by looping them. ror8 is better, and the advantage of ror8 becomes more obvious as the
> > number of iterations increases. The test code is as follows:
> > ```
> > 	u8 word = 0x5a;
> > 	u32 shift = 9;
> > 	u32 i, loop = 100;
> > 	u8 ret1, ret2;
> > 
> > 	u64 t1 = ktime_get_ns();
> > 	for (i = 0; i < loop; i++) {
> > 		ret2 = generic_ror8(word, shift);
> > 	}
> > 	u64 t2 = ktime_get_ns();
> > 	for (i = 0; i < loop; i++) {
> > 		ret1 = ror8(word, shift);
> > 	}
> > 	u64 t3 = ktime_get_ns();
> > 
> > 	pr_info("t2-t1=%lld t3-t2=%lld\n", t2 - t1, t3 - t2);
> > ```  
> 
> Please do the following:
> 
> 1. Drop the generic_ror8() and keep only ror/l8()
> 2. Add ror/l16, 34 and 64 tests.
> 3. Adjust the 'loop' so that each subtest will take 1-10 ms on your hw.

That is far too many iterations.
You'll get interrupts dominating the tests.
The best thing is to do 'just enough' iterations to get a meaningful result,
and then repeat a few times and report the fastest (or average excluding
any large outliers).

You also need to ensure the compiler doesn't (or isn't allowed to) pull
the contents of the inlined function outside the loop - and then throw
the loop away,

The other question is whether any of it is worth the effort.
How many ror8() and ror16() calls are there?
I suspect not many.

Improving the generic ones might be worth while.
Perhaps moving the current versions to x86 only.
(I suspect the only other cpu with byte/short rotates is m68k)

	David

> 4. Name the function like test_rorl(), and put it next to the test_fns()
>    in lib/test_bitops.c. Refer the 0a2c6664e56f0 for implementation.
> 5. Prepend the series with the benchmark patch, just as with const-eval
>    one, and report before/after for your series. 
> 
> > > Please find attached a test for compile-time ror/rol evaluation.
> > > Please consider prepending your series with it.  
> > 
> > Okay, I'll bring it to the next series.  
> 
> Still waiting for bloat-o-meter results.
> 
> Thanks,
> Yury
> 
> > 
> > [1] https://github.com/riscv/riscv-bitmanip  
>