[PATCH] riscv: lib: optimize strlen loop efficiency

Wed Jan 28 10:59:04 PST 2026

On Thu, 15 Jan 2026 18:46:19 +0000
David Laight <david.laight.linux at gmail.com> wrote:

... 
> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
> the fixed cost may move the break-even point above the length of the
> average strlen() in the kernel.
> Of course, x86 probably falls back to 'rep scasb' at (maybe)
> (40 + 2n) clocks for 'n' bytes.
> A carefully written slightly unrolled asm loop might manage one
> byte per clock!
> I could spend weeks benchmarking different versions.

I've spent a quick half-hour...

On my zen-5 in userspace:

glibc's strlen() is showing the same fixed cost (50 clocks including overhead)
for sizes below (about) 100 bytes, for big buffers add 1 clock for ~50 bytes.
It must be using some simd instructions.

A simple:
	len = 0; while (s[len]) len++; return len;
loop is about 1 byte/clock, overhead ~25 clocks (probably the mostly one 'rdpmc'
instruction).
(Needs a barrier() to stop gcc converting it to a libc call.)

Unrolling the loop once:
	for (len = 0; s[len]; len += 2)
		if (!s[len + 1] return len + 1;
	return len;
actually runs twice as fast - so 2 bytes/clock.

Unrolling 4 times doesn't help, suddenly goes somewhat slower somewhere
between 128 and 256 bytes (to 1.5 bytes/clock).

The C 'longs' loop has an overhead of ~45 clocks and does 6 bytes/clock.
So the is better for buffers longer than 64 bytes.

The 'elephant in the room' is 'repne scasb'.
The fixed cost is some 150 clocks and the cost 3 clocks/byte.

I don't think any of the Intel cpu I have will do a 'one clock loop'.
I certainly failed to get one in the past when there was a data-dependency
between the iterations.

But I don't have anything modern (newest is an i7-7xxx) and I don't have
any old amd ones.
I needs to get a zen-1 (or 1a) and one of the Intel system that should be
cheap because they won't run win-11.

	David