RFC: Dynamic hwcaps

Dave Martin dave.martin at linaro.org
Tue Dec 7 05:45:42 EST 2010


On Tue, Dec 7, 2010 at 1:02 AM, Mark Mitchell <mark at codesourcery.com> wrote:
> On 12/6/2010 5:07 AM, Dave Martin wrote:
>>> But,
>>> to enable binary distribution, having to have N copies of a library (let
>>> alone an application) for N different ARM core variants just doesn't
>>> make sense to me.
>> Just so, and as discussed before improvements to package managers
>> could help here to avoid installing duplicate libraries.  (I believe
>> that rpm may have some capability here (?) but deb does not at
>> present).
> Yes, a smarter package manager could help a device builder automatically
> get the right version of a library.  But, something more fundamental has
> to happen to avoid the library developer having to *produce* N versions
> of a library.  (Yes, in theory, you just type "make" with different
> CFLAGS options, but in practice of course it's often more complex than
> that, especially if you need to validate the library.)

Yes-- though I didn't elaborate on it.  You need a packager that can
understand, say, that a binary built for ARMv5 EABI can interoperate
with ARMv7 binaries etc.
Again, I've heard it suggested that RPM can handle this, but I haven't
looked at it in detail myself.

>> Currently, I don't have many examples-- the main one is related to the
>> discussions aroung using NEON for memcpy().  This can be a performance
>> win on some platforms, but except when the system is heavily loaded,
>> or when NEON happens to be turned on anyway, it may not be
>> advantageous for the user or overall system performance.
> How good of a proxy would the length of the copy be, do you think?  If
> you want to copy 1G of data, and NEON makes you 2x-4x faster, then it
> seems to me that you probably want to use NEON, almost independent of
> overall system load.  But, if you're only going to copy 16 bytes, even
> if NEON is faster, it's probably OK not to use it -- the function-call
> overhead to get into memcpy at all is probably significant relative to
> the time you'd save by using NEON.  In between, it's harder, of course
> -- but perhaps if memcpy is the key example, we could get 80% of the
> benefit of your idea simply by a test inside memcpy as to the length of
> the data to be copied?

For the memcpy() case, the answer is probably yes, though how often
memcpy is called by a given thread is also of significance.

However, there's still a problem: NEON is not designed for
implementing memcpy(), so there's no guarantee that it will always be
faster ... it is on some SoCs in some situations, but much less
beneficial on others -- the "sweet spots" both for performance and
power may differ widely from core to core and from SoC to SoC.  So
running benchmarks on one or two boards and then hard-compiling some
thresholds into glibc may not be the right approach.  Also, gcc
implements memcpy directly too for some cases (but only for small

The dynamic hwcaps approach doesn't really solve that problem: for
adapting to different SoCs, you really want a way to run a benchmark
on the target to make your decision (xine-lib chooses an internal
memcpy implementation this way for example), or a way to pass some
platform metrics to glibc / other affected libraries.  Identifying the
precise SoC from /proc/cpuinfo isn't always straightforward, but I've
seen some code making use of it in similar ways.


More information about the linux-arm-kernel mailing list