> BTW off topic (but relevant to this patchset), I strongly feel that
> routines like memset/memcpy are better coded in assembly for really
> water tight instruction scheduling and ease of further optimizing (e.g.
> use of CMO.zero etc as experimented by Philipp). What is blocking you
> from optimizing the asm version ? You are leaving the fate of these
> critical routines in the hand of compiler - this can lead to performance
> shenanigans on a big gcc upgrade.

You also need to worry about the cost of short transfers.
A few cycles there could have a much bigger difference
that something that speeds up long transfers.
Short ones are likely to be fairly common.
I doubt the loop unrolling optimisation in gcc is actually
any good for loops that might be done a few times.

Fortunately the kernel doesn't get 'hit by' gcc unrolling
loops into the AVX instructions.
The setup costs for that (and I-cache footprint) are horrid.
Although I suspect it is that optimisation that 'broke'
code that used misaligned pointers on overlapping data.

It is a general problem with the 'one size fits all' memcpy().


