Call for testing/opinions: Optimized memset/memcpy

Sat Jul 13 17:51:18 EDT 2013

Willy Tarreau <w <at> 1wt.eu> writes:

> OK I've run bench.script on the following platforms :

Thanks, that's incredibly helpful!

Note that Thumb2 mode usually doesn't do much in synthetic benchmarks,
because the benchmark code will fit into the L1 instruction cache; the
benefit of Thumb2 happens in real-world usage when the active code
footprint becomes larger.

To summarize, memset seems to be in good shape and also the "fast path"
for common word-aligned memcpy of size <= 256 seems to be working well.

However, the copy_page and memcpy results for larger sizes seem to suggest
that the prefetch strategy isn't working well on these platforms. Note also
that on the quad core the existing copy_page is also highly sub-optimal.

Fixing the preload strategy for these platforms may simply be a case of
changing the configurable constant PREFETCH_DISTANCE from 3 to 2 (from an
offset of 192 bytes to 128 bytes), which more closely mimics the original
kernel memcpy. I have added PREFETCH_DISTANCE as a configurable parameter
in the Makefile in the latest version of test-arm-kernel-memcpy. It will
be interesting to see the results of testing with a PREFETCH_DISTANCE
of 2 especially on the quad-core platform or a similar one.