[PATCH] riscv: use the generic string routines

Sat Sep 11 10:26:12 PDT 2021

..
> These ended up getting rejected by Linus, so I'm going to hold off on
> this for now.  If they're really out of lib/ then I'll take the C
> routines in arch/riscv, but either way it's an issue for the next
> release.

I've been half following this.
I've not seen any comparisons between the C functions proposed
here and the riscv asm ones that had the fix for misaligned
transfers applied.

IIRC there is a comment in the asm ones that the unrolled
'read lots' - 'write lots' loop is faster than the older
(asm) read-write loop.

But I've not seen any archictural discussions at all.

A simple in-order single-issue cpu will execute the
unrolled loop faster just because it has fewer instructions.
The read-lots - write-lots almost certainly helps
avoid read-latency delaying things if multiple reads
can be pipelined.
The writes are almost certainly 'posted' and pipelined,
But a simple cpu could easily require all writes finish
before doing a read.

A super-scaler (multi-issue) cpu gives you the ability
to get the loop control instructions 'for free' with
carefully written assembler.
At which point a copy for 'life cache' data should be
limited only by the cpu's cache memory bandwidth.

If reads and writes can interleave then a loop that
alternates reads and writes (read each register
just after writing it) may mean that you always
keep the cpu-cache interface busy.
This would be especially true if the cpu can execute
both a cache read and write in the same cycle.
(Which many moderate performance cpu can.)

None of the requires out-of-order execution, just
execution to continue while a read is in progress.

I'm also guessing that any performance testing has been
done with the (relatively) cheap boards that are readily
available.

But I've also seen references in the press to much faster
riscv cpu that are definitely multi-issue and may have
some simple out-of-order execution.
Any changes ought to be tested on these faster systems.

I also recall that some of the performance measurements
were made with long buffers - they will be dominated by the
cache to DRAM (and maybe TLB lookup) timings, not the copy
loop.

For a simple cpu you ought to be able to measure the
number of cpu cycles used for a copy - and account for
all of them.
For something like x86 you can show that the copy is
being limited by the cpu-cache bandwidth.
(FWIW measurements of the inet checksum code on x86
show it runs at half the expected speed on a lot of
Intel cpu - no one ever measured it.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)