[PATCH 0/4] arm64: advertise availability of CRC and crypto instructions

Thu Dec 19 12:33:45 EST 2013

On 19 December 2013 07:48, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> On Wed, 18 Dec 2013 22:57:33 +0100
> Ard Biesheuvel <ard.biesheuvel at linaro.org> wrote:
>
>> On 18 December 2013 22:18, Nicolas Pitre <nicolas.pitre at linaro.org> wrote:
>> > On Wed, 18 Dec 2013, Ard Biesheuvel wrote:
>> >> The nice thing about hwcaps is that it is already integrated into the
>> >> ifunc resolution done by the loader, which makes it very easy and
>> >> straightforward to offer alternative implementations of library
>> >> functions based on CPU capabilities.
>> >
>> > The library may as well implement its own ifunc that tests the
>> > instruction while trapping SIGILL.  On those systems with the supported
>> > instruction there will be no trap.  On those that traps then the
>> > alternative implementation is going to be much slower anyway.
>> >
>>
>> True. And the trap still only occurs at load time. But I think we
>> agree it is essentially a poor man's hwcaps.
>
> And the hwcaps is essentially a poor man's replacement for a userspace
> accessible CPUID instruction enjoyed by x86.
>
> It's sad to see that the runtime CPU features detection still remains
> a PITA with AArch64. Basically, it's not enough to know if the
> instruction is supported or not. Different microarchitectures may
> various performance quirks for certain instructions. For example,
> VFPLite in Cortex-A8 is non-pipelined and slow. Cortex-A15 can
> dual-issue NEON instructions (nice for the code which can enjoy
> high ILP), but Cortex-A15 NEON instructions have relatively high
> latency (bad for the code, which is essentially a long dependency
> chain). The fastest way to read uncached memory for most ARM
> processors is to use the VFP load multiple instruction with as
> many registers as possible, but this is slow on Marvell PJ4. And
> so on.
>

You are comparing apples and oranges.

It is fairly well known that you are better off using the NEON for
floating point on a Cortex-A8, if you can afford the reduced
precision. But if you /can't/ afford the reduced precision, you are
still better off using VFP-lite than using software emulation.

The same applies to the Crypto Extensions: it is highly unlikely that
you will care about the particular implementation of the AES
instructions if you are faced with the choice of using those
instructions or using a software implementation. So using hwcaps bits
for these kinds of features makes perfect sense. (And so does enabling
the 'has-vfp' bit for VFP-lite)

I do agree with you that the heterogeneity between various ARM
implementors is a PITA at times, and knowing which CPU exactly you are
running on is a valid question in those cases (btw this applies to SSE
on Atom as well).
But please don't confuse it with the simple presence or absence of
some CPU extension.

Regards,
Ard.