[PATCH] riscv: lib: Optimize 'strlen' function

Sun Dec 17 15:23:19 PST 2023

On 12/17/23 18:10, David Laight wrote:
> From: Ivan Orlov
>> Sent: 13 December 2023 15:46
> 
> Looking at the old code...
> 
>>   1:
>> -	lbu	t0, 0(t1)
>> -	beqz	t0, 2f
>> -	addi	t1, t1, 1
>> -	j	1b
> 
> I suspect there is (at least) a two clock stall between
> the 'ldu' and 'beqz'.

Hmm, the stall exists due to memory access? Why does two subsequent 
accesses to the memory (as in the example you provided) do the trick? Is 
it because two "ldb"s could be parallelized?

> Allowing for one clock for the 'predicted taken' branch
> that is 7 clocks/byte.
> 
> Try this one - especially on 32bit:
> 
> 	mov	t0, a0
> 	and	t1, t0, 1
> 	sub	t0, t0, t1
> 	bnez	t1, 2f
> 1:
> 	ldb	t1, 0(t0)
> 2:	ldb	t2, 1(t0)
> 	add	t0, t0, 2
> 	beqz	t1, 3f
> 	bnez	t2, 1b
> 	add	t0, t0, 1
> 3:	sub	t0, t0, 2
> 	sub	a0, t0, a0
> 	ret
>
I tested it on my 64bit board, and this variant is definitely faster 
than the original implementation! Here is the results of the benchmark 
which compares this variant with the word-oriented one:

Test count per size: 1000

Size: 1 (+-0), mean_old: 711, mean_new: 708
Size: 2 (+-0), mean_old: 649, mean_new: 713
Size: 4 (+-0), mean_old: 499, mean_new: 506
Size: 8 (+-0), mean_old: 344, mean_new: 350
Size: 16 (+-0), mean_old: 342, mean_new: 362
Size: 32 (+-0), mean_old: 369, mean_new: 387
Size: 64 (+-0), mean_old: 393, mean_new: 401
Size: 128 (+-4), mean_old: 457, mean_new: 424
Size: 256 (+-13), mean_old: 578, mean_new: 476
Size: 512 (+-31), mean_old: 842, mean_new: 573
Size: 1024 (+-19), mean_old: 1305, mean_new: 777
Size: 2048 (+-97), mean_old: 2280, mean_new: 1193
Size: 4096 (+-149), mean_old: 4226, mean_new: 2002
Size: 8192 (+-439), mean_old: 8131, mean_new: 3634
Size: 16384 (+-615), mean_old: 16353, mean_new: 6905
Size: 32768 (+-2566), mean_old: 37075, mean_new: 14232
Size: 65536 (+-6047), mean_old: 73797, mean_new: 37090
Size: 131072 (+-10071), mean_old: 146802, mean_new: 73402
Size: 262144 (+-18150), mean_old: 293003, mean_new: 146118
Size: 524288 (+-21247), mean_old: 585057, mean_new: 291324

Benchmark code:

https://github.com/ivanorlov2206/strlen-benchmark/blob/main/strlentest.c

It looks like the variant you suggested could be faster for shorter 
strings even on the 64bit platform.

Maybe we could enhance it even more by loading 4 consequent bytes into 
different registers so the memory loads would still be parallelized?

-- 
Kind regards,
Ivan Orlov