[PATCH] riscv: lib: optimize strlen loop efficiency

Thu Jan 29 00:34:33 PST 2026

On 2026/1/29 02:59, David Laight wrote:
> On Thu, 15 Jan 2026 18:46:19 +0000
> David Laight <david.laight.linux at gmail.com> wrote:
> 
> ... 
>> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
>> the fixed cost may move the break-even point above the length of the
>> average strlen() in the kernel.
>> Of course, x86 probably falls back to 'rep scasb' at (maybe)
>> (40 + 2n) clocks for 'n' bytes.
>> A carefully written slightly unrolled asm loop might manage one
>> byte per clock!
>> I could spend weeks benchmarking different versions.
> 
> I've spent a quick half-hour...
> 
> On my zen-5 in userspace:
> 
> glibc's strlen() is showing the same fixed cost (50 clocks including overhead)
> for sizes below (about) 100 bytes, for big buffers add 1 clock for ~50 bytes.
> It must be using some simd instructions.
> 
> A simple:
> 	len = 0; while (s[len]) len++; return len;
> loop is about 1 byte/clock, overhead ~25 clocks (probably the mostly one 'rdpmc'
> instruction).
> (Needs a barrier() to stop gcc converting it to a libc call.)
> 
> Unrolling the loop once:
> 	for (len = 0; s[len]; len += 2)
> 		if (!s[len + 1] return len + 1;
> 	return len;
> actually runs twice as fast - so 2 bytes/clock.
> 
> Unrolling 4 times doesn't help, suddenly goes somewhat slower somewhere
> between 128 and 256 bytes (to 1.5 bytes/clock).
> 
> The C 'longs' loop has an overhead of ~45 clocks and does 6 bytes/clock.
> So the is better for buffers longer than 64 bytes.
> 
> The 'elephant in the room' is 'repne scasb'.
> The fixed cost is some 150 clocks and the cost 3 clocks/byte.
> 
> I don't think any of the Intel cpu I have will do a 'one clock loop'.
> I certainly failed to get one in the past when there was a data-dependency
> between the iterations.
> 
> But I don't have anything modern (newest is an i7-7xxx) and I don't have
> any old amd ones.
> I needs to get a zen-1 (or 1a) and one of the Intel system that should be
> cheap because they won't run win-11.

Thank you very much for sharing these detailed test results and your in-depth
analysis. It is truly helpful and inspiring to see how different loop strategies
perform on the wire.

My current priority is the RISC-V patch series. Once this is done, I'd love to
follow up and explore potential improvements for the generic C implementation.

While I don't have many x86 machines at hand either, I do have access to some
ARM64 and LoongArch hardware. I think I can also perform tests and observations
on these platforms later.

Thanks again for the great discussion!

-- 
With Best Regards,
Feng Jiang