[PATCH] riscv: lib: optimize strlen loop efficiency
David Laight
david.laight.linux at gmail.com
Thu Jan 15 10:46:19 PST 2026
On Thu, 15 Jan 2026 11:19:47 +0000
David Laight <david.laight.linux at gmail.com> wrote:
> For 64bit you can do a lot better (in C) by loading 64bit words and doing
> the correct 'shift and mask' sequence to detect a zero byte.
> It usually isn't worth in for 32bit.
>
> Does need to handle a mis-aligned base - eg by masking the bits off
> the base pointer and or'ing in non-zero values to the value read from
> the base pointer.
>
> David
The version below seems to work https://www.godbolt.org/z/sME3Ts6vW
It actually looks ok for x86-32, the loop is 8 instructions plus the branch
but the 'register dependency chain' is only 4 instructions.
So maybe better than byte compares for moderate to long strings.
(Especially if the cpu starts speculatively executing the next loop
iteration.)
The OPTIMIZER_HIDE_VAR() helps a lot on (eg) MIPS-64 and a bit elsewhere
since most 64bit cpu can't load 64bit immediates.
I can't get gcc and clang to reliably have a loop with a conditional
jump at the bottom, especially with an unconditional jump into the
loop (to remove the '| mask' from the loop body).
Also KASAN (or one of its friends) wont like the code reading entire
words that hold the string.
And it does need ffs/clz instructions - or a different loop bottom.
(For BE one with clzl() returning 0 will work.)
While I suspect the per-byte cost is 'two bytes/clock' on x86-64
the fixed cost may move the break-even point above the length of the
average strlen() in the kernel.
Of course, x86 probably falls back to 'rep scasb' at (maybe)
(40 + 2n) clocks for 'n' bytes.
A carefully written slightly unrolled asm loop might manage one
byte per clock!
I could spend weeks benchmarking different versions.
David
#define OPTIMIZER_HIDE_VAR(var) \
__asm__ ("" : "=r" (var) : "0" (var))
/* Set BE to test big-endian on little-endian.
* For real BE either do a byteswapping read or use the BE code. */
#ifdef BE
#define SWP(x) __builtin_bswap64(x)
#define SHIFT <<
#else
#define SWP(x) (x)
#define SHIFT >>
#endif
unsigned long my_strlen(const char *s)
{
unsigned int off = (unsigned long)s % sizeof (long);
const unsigned long *p = (void *)(s - off);
unsigned long val;
unsigned long mask;
unsigned long ones = 0x01010101;
/* Force the compiler to generate the related constants sanely. */
OPTIMIZER_HIDE_VAR(ones);
ones |= ones << 16 << 16;
mask = ((~0ul SHIFT 8) SHIFT 8 * (sizeof (long) - 1 - off));
do {
val = SWP(*p++) | mask;
mask = (val - ones) & ~val & ones << 7;
} while (!mask);
#ifdef BE
off = __builtin_clzl(mask);
/* Correct for "...\x01" */
val <<= off;
for (off /= 8; val > (~0ul >> 8); off++)
val <<= 8;
#else
off = (__builtin_ffsl(mask) - 1)/8;
#endif
return (const char *)(p - 1) + off - s;
}
More information about the linux-riscv
mailing list