Call for testing/opinions: Optimized memset/memcpy

Harm Hanemaaijer fgenfb at yahoo.com
Sun Jul 14 07:00:50 EDT 2013


Willy Tarreau <w <at> 1wt.eu> writes:

> 
> Please find the results attached. It seems that memcpy improved by 0.8%
> though that's not even certain.
> 

What is interesting is that
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html,
and several other sources (such as other
optimized memcpy implementations) document the cache line size of the Cortex
A9 as 32 bytes, which is an anomaly in the armv7 family. However, it looks
like the kernel is defining L1_CACHE_BYTES as 64 (L1_CACHE_SHIFT == 6) for
all armv7 platforms, which looks like a serious configuring error for Cortex
A9.

This explains why the large size memcpy results that you posted are not
optimal, and also explains the below-par copy_page performance in the current
kernel implementation, because copy_page uses L1_CACHE_BYTES to determine the
preload strategy, while the current memcpy doesn't (it is hardcoded for
L1_CACHE_BYTES of 32).

This merits further investigation, and there might potentially be other
kernel issues for Cortex A9 (including performance) related to this.

To confirm, does running 'zcat /proc/config.gz| grep L1_CACHE_SHIFT' on a
Cortex A9 show CONFIG_ARM_L1_CACHE_SHIFT defined as 6?





More information about the linux-arm-kernel mailing list