[PATCH] riscv: lib: Implement optimized memchr function

Mon Dec 11 07:25:15 PST 2023

On 11/12/2023 15:08, Andrew Jones wrote:
>> As you can see, the new function shows much better results even for
>> the small arrays of 256 elements, therefore I believe it could be a
>> useful addition to the existing riscv-specific string functions.
> 
> Looks good, but do we want to maintain both this version and a zbb
> version? I'd expect a zbb version to be even better.
> 

Hi Andrew,

Yes, ZBB analog would be much better, and if we use ZBB operations we 
could avoid the most part of bit magic happening there.

>> +	add t1, x0, a2
> 
> move t1, a2
> 
> and for the remainder of the function s/x0/zero/
> 

Alright, will be fixed in the next version.
>> +	sltiu t2, a2, MIN_BORDER
>> +	bnez t2, 6f
>> +
>> +	// get the number of bytes we should iterate before alignment
> 
> I'm not sure, but I think even in assembly we prefer the /* */ comment
> format.
> 
>> +	andi t0, a0, SZREG - 1
>> +	beqz t0, 4f
>> +
>> +	# get the SZREG - t0
> 
> I'm 99% sure we don't want to use the # comment syntax.
> 
>> +	xor t0, t0, SZREG - 1
> 
> xori?
> 

Hmm, I'm surprised that it is actually compilable... Yeah, should be fixed
>> +	addi t0, t0, 1
>> +
>> +	sub a2, a2, t0
> 
> nit: Looks a bit odd to put a blank line above the sub line above,
> instead of above the below comment.
> 
>> +	// iterate before alignment
>> +1:
>> +	beq t0, x0, 4f
>> +	lbu t2, 0(a0)
>> +	beq t2, a1, 3f
>> +	addi t0, t0, -1
> 
> This addi t0... isn't necessary if we do
> 

Yeah, sounds reasonable, we can make it faster
> 	add t0, a0, t0
> 1:
> 	beq a0, t0, 4f
> 	...
> 	...
> 	addi a0, a0, 1
> 	j 1b
> 
>> +	addi a0, a0, 1
>> +	j 1b
>> +
>> +2:
>> +	// found a word. Iterate it until we find the target byte
>> +	li t1, SZREG
>> +	j 6f
> 
> These instructions seem oddly placed among the rest.
> 
>> +3:
>> +	ret
> 
> And this is an odd place to put this ret (after unconditional jump and
> in the middle of the function). We can just put a label at the bottom ret.
> 

I agree, thanks!
>> +
>> +4:
>> +	// get the count remainder
>> +	andi t1, a2, SZREG - 1
>> +
>> +	// align the count
>> +	sub a2, a2, t1
>> +
>> +	// if we have no words to iterate, iterate the remainder
>> +	beqz a2, 6f
>> +
>> +	// from 0xBA we will get 0xBABABABABABABABA
>> +	li t3, REP_01
>> +	mul t3, t3, a1
> 
> I don't think we want to implement an optimized assembly function with
> mul. We can just use a few shifts and ors.
> 
> 	slli	t3, a1, 8
> 	or	t3, t3, a1
> 	slli	t4, t3, 16
> 	or	t3, t4, t3
> #if __riscv_xlen == 64
> 	slli	t4, t3, 32
> 	or	t3, t4, t3
> #endif
> 

Nice point, thanks! Will be optimized :)
>> +
>> +	add a2, a2, a0
>> +
>> +	li t4, REP_01
>> +	li t5, REP_80
>> +
>> +5:
>> +	REG_L t2, 0(a0)
>> +
>> +	// after this xor we will get one zero byte in the word if it contains the target byte
>> +	xor t2, t2, t3
>> +
>> +	// word v contains the target byte if (v - 0x01010101) & (~v) & 0x80808080 is positive
> 
> s/is positive/is not zero/
> 
>> +	sub t0, t2, t4
>> +
>> +	not t2, t2
>> +
>> +	and t0, t0, t2
>> +	and t0, t0, t5
>> +
>> +	bnez t0, 2b
>> +	addi a0, a0, SZREG
>> +	bne a0, a2, 5b
>> +
>> +6:
>> +	// iterate the remainder
>> +	beq t1, x0, 7f
>> +	lbu t4, 0(a0)
>> +	beq t4, a1, 3b
>> +	addi a0, a0, 1
>> +	addi t1, t1, -1
> 
> Same comment as above about being able to drop the addi t1...
> 
>> +	j 6b
>> +
>> +7:
>> +	addi a0, x0, 0
> 
> li a0, 0
> 
>> +	ret
>> +SYM_FUNC_END(memchr)
>> +SYM_FUNC_ALIAS(__pi_memchr, memchr)
>> -- 
>> 2.34.1
>>
> 
> Thanks,
> drew
>

Thanks a lot for the review!

--
Kind regards,
Ivan Orlov