[PATCH] riscv: lib: Implement optimized memchr function
Ivan Orlov
ivan.orlov at codethink.co.uk
Mon Dec 11 07:25:15 PST 2023
On 11/12/2023 15:08, Andrew Jones wrote:
>> As you can see, the new function shows much better results even for
>> the small arrays of 256 elements, therefore I believe it could be a
>> useful addition to the existing riscv-specific string functions.
>
> Looks good, but do we want to maintain both this version and a zbb
> version? I'd expect a zbb version to be even better.
>
Hi Andrew,
Yes, ZBB analog would be much better, and if we use ZBB operations we
could avoid the most part of bit magic happening there.
>> + add t1, x0, a2
>
> move t1, a2
>
> and for the remainder of the function s/x0/zero/
>
Alright, will be fixed in the next version.
>> + sltiu t2, a2, MIN_BORDER
>> + bnez t2, 6f
>> +
>> + // get the number of bytes we should iterate before alignment
>
> I'm not sure, but I think even in assembly we prefer the /* */ comment
> format.
>
>> + andi t0, a0, SZREG - 1
>> + beqz t0, 4f
>> +
>> + # get the SZREG - t0
>
> I'm 99% sure we don't want to use the # comment syntax.
>
>> + xor t0, t0, SZREG - 1
>
> xori?
>
Hmm, I'm surprised that it is actually compilable... Yeah, should be fixed
>> + addi t0, t0, 1
>> +
>> + sub a2, a2, t0
>
> nit: Looks a bit odd to put a blank line above the sub line above,
> instead of above the below comment.
>
>> + // iterate before alignment
>> +1:
>> + beq t0, x0, 4f
>> + lbu t2, 0(a0)
>> + beq t2, a1, 3f
>> + addi t0, t0, -1
>
> This addi t0... isn't necessary if we do
>
Yeah, sounds reasonable, we can make it faster
> add t0, a0, t0
> 1:
> beq a0, t0, 4f
> ...
> ...
> addi a0, a0, 1
> j 1b
>
>> + addi a0, a0, 1
>> + j 1b
>> +
>> +2:
>> + // found a word. Iterate it until we find the target byte
>> + li t1, SZREG
>> + j 6f
>
> These instructions seem oddly placed among the rest.
>
>> +3:
>> + ret
>
> And this is an odd place to put this ret (after unconditional jump and
> in the middle of the function). We can just put a label at the bottom ret.
>
I agree, thanks!
>> +
>> +4:
>> + // get the count remainder
>> + andi t1, a2, SZREG - 1
>> +
>> + // align the count
>> + sub a2, a2, t1
>> +
>> + // if we have no words to iterate, iterate the remainder
>> + beqz a2, 6f
>> +
>> + // from 0xBA we will get 0xBABABABABABABABA
>> + li t3, REP_01
>> + mul t3, t3, a1
>
> I don't think we want to implement an optimized assembly function with
> mul. We can just use a few shifts and ors.
>
> slli t3, a1, 8
> or t3, t3, a1
> slli t4, t3, 16
> or t3, t4, t3
> #if __riscv_xlen == 64
> slli t4, t3, 32
> or t3, t4, t3
> #endif
>
Nice point, thanks! Will be optimized :)
>> +
>> + add a2, a2, a0
>> +
>> + li t4, REP_01
>> + li t5, REP_80
>> +
>> +5:
>> + REG_L t2, 0(a0)
>> +
>> + // after this xor we will get one zero byte in the word if it contains the target byte
>> + xor t2, t2, t3
>> +
>> + // word v contains the target byte if (v - 0x01010101) & (~v) & 0x80808080 is positive
>
> s/is positive/is not zero/
>
>> + sub t0, t2, t4
>> +
>> + not t2, t2
>> +
>> + and t0, t0, t2
>> + and t0, t0, t5
>> +
>> + bnez t0, 2b
>> + addi a0, a0, SZREG
>> + bne a0, a2, 5b
>> +
>> +6:
>> + // iterate the remainder
>> + beq t1, x0, 7f
>> + lbu t4, 0(a0)
>> + beq t4, a1, 3b
>> + addi a0, a0, 1
>> + addi t1, t1, -1
>
> Same comment as above about being able to drop the addi t1...
>
>> + j 6b
>> +
>> +7:
>> + addi a0, x0, 0
>
> li a0, 0
>
>> + ret
>> +SYM_FUNC_END(memchr)
>> +SYM_FUNC_ALIAS(__pi_memchr, memchr)
>> --
>> 2.34.1
>>
>
> Thanks,
> drew
>
Thanks a lot for the review!
--
Kind regards,
Ivan Orlov
More information about the linux-riscv
mailing list