[PATCH 2/2] bitops: rotate: Add riscv implementation using Zbb extension

cp0613 at linux.alibaba.com cp0613 at linux.alibaba.com
Mon Jun 30 05:04:53 PDT 2025


On Sat, 28 Jun 2025 21:48:02 -0400, yury.norov at gmail.com wrote:

> > > > +static inline u8 variable_ror8(u8 word, unsigned int shift)
> > > > +{
> > > > +	u32 word32 = ((u32)word << 24) | ((u32)word << 16) | ((u32)word << 8) | word;
> > > 
> > > Can you add a comment about what is happening here? Are you sure it's
> > > optimized out in case of the 'legacy' alternative?
> > 
> > Thank you for your review. Yes, I referred to the existing variable__fls()
> > implementation, which should be fine.
> 
> No, it's not fine. Because you trimmed your original email completely,
> so there's no way to understand what I'm asking about; and because you
> didn't answer my question. So I'll ask again: what exactly you are doing
> in the line you've trimmed out?

Sorry, I misunderstood your question. Now I have made up for the lost original
email. This is my answer. The RISC-V Zbb extension only provides 64-bit data
rotation instructions rol/ror and 32-bit data rotation instructions rolw/rorw.
Therefore, for 16-bit and 8-bit data, in order to use the rolw/rorw instruction
optimization, the data is cyclically spliced ​​here, and the corresponding number
of bits is truncated after processing to achieve the function.

This data preparation process does introduce additional operations. Compared with
genneric's implementation, I use the web tool provided by David to illustrate.

The two functions that need to be compared are briefly summarized as follows:
```
unsigned char generic_ror8(unsigned char word, unsigned int shift)
{
	return (word >> (shift & 7)) | (word << ((-shift) & 7));
}

unsigned char zbb_opt_ror8(unsigned char word, unsigned int shift)
{
	unsigned int word32 = ((unsigned int)word << 24) | \
	    ((unsigned int)word << 16) | ((unsigned int)word << 8) | word;
#ifdef __riscv
	__asm__ volatile("nop"); // ALTERNATIVE(nop)

	__asm__ volatile(
		".option push\n"
		".option arch,+zbb\n"
		"rorw %0, %1, %2\n"
		".option pop\n"
		: "=r" (word32) : "r" (word32), "r" (shift) :);
#endif
	return (unsigned char)word32;
}
```
The disassembly obtained is:
```
generic_ror8:
    andi    a1,a1,7
    negw    a5,a1
    andi    a5,a5,7
    sllw    a5,a0,a5
    srlw    a0,a0,a1
    or      a0,a0,a5
    andi    a0,a0,0xff
    ret

zbb_opt_ror8:
    slli    a5,a0,8
    add     a0,a5,a0
    slliw   a5,a0,16
    addw    a5,a5,a0
    nop
    rorw a5, a5, a1
    andi    a0,a5,0xff
    ret
```
>From the perspective of the total number of instructions, although zbb_opt_ror8 has
one more instruction, one of them is a nop, so the difference with generic_ror8 should
be very small, or using the solution provided by David would be better for non-x86.

> > I did consider it, but I did not find any toolchain that provides an
> > implementation similar to __builtin_ror or __builtin_rol. If there is one,
> > please help point it out.
> 
> This is the example of the toolchain you're looking for:
> 
>   /**
>    * rol64 - rotate a 64-bit value left
>    * @word: value to rotate
>    * @shift: bits to roll
>    */
>   static inline __u64 rol64(__u64 word, unsigned int shift)
>   {
>           return (word << (shift & 63)) | (word >> ((-shift) & 63));
>   }
> 
> What I'm asking is: please show me that compile-time rol/ror is still
> calculated at compile time, i.e. ror64(1234, 12) is evaluated at
> compile time.

I see what you mean, I didn't consider the case of constants being evaluated
at compile time, as you pointed out earlier:
"you wire ror/rol() to the variable_ror/rol() unconditionally, and that breaks
compile-time rotation if the parameter is known at compile time."

In the absence of compiler built-in function support, I think it can be handled
like this:
```
#define rol16(word, shift) \
	(__builtin_constant_p(word) && __builtin_constant_p(shift) ? \
	generic_ror16(word, shift) : variable_rol16(word, shift))
```
How do you see?

> > In addition, I did not consider it carefully before. If the rotate function
> > is to be genericized, all archneed to include <asm-generic/bitops/rotate.h>.
> > I missed this step.
> 
> Sorry, I'm lost here about what you've considered and what not. I'm OK
> about accelerating ror/rol, but I want to make sure that;
> 
> 1. The most trivial compile-case is actually evaluated at compile time; and
> 2. Any arch-specific code is well explained; and
> 3. legacy case optimized just as well as non-legacy.

1. As in the above reply, use the generic implementation when compile-time evaluation
   is possible。
2. I will improve the comments later.
3. As mentioned before, only 8-bit rotation should have no optimization effect, and
   16-bit and above will have significant optimization.

Thanks,
Pei




More information about the linux-riscv mailing list