[PATCH 0/4] arm64: advertise availability of CRC and crypto instructions

Siarhei Siamashka siarhei.siamashka at gmail.com
Thu Dec 19 20:35:51 EST 2013


On Thu, 19 Dec 2013 18:33:45 +0100
Ard Biesheuvel <ard.biesheuvel at linaro.org> wrote:

> On 19 December 2013 07:48, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> > On Wed, 18 Dec 2013 22:57:33 +0100
> > Ard Biesheuvel <ard.biesheuvel at linaro.org> wrote:
> >
> >> On 18 December 2013 22:18, Nicolas Pitre <nicolas.pitre at linaro.org> wrote:
> >> > On Wed, 18 Dec 2013, Ard Biesheuvel wrote:
> >> >> The nice thing about hwcaps is that it is already integrated into the
> >> >> ifunc resolution done by the loader, which makes it very easy and
> >> >> straightforward to offer alternative implementations of library
> >> >> functions based on CPU capabilities.
> >> >
> >> > The library may as well implement its own ifunc that tests the
> >> > instruction while trapping SIGILL.  On those systems with the supported
> >> > instruction there will be no trap.  On those that traps then the
> >> > alternative implementation is going to be much slower anyway.
> >> >
> >>
> >> True. And the trap still only occurs at load time. But I think we
> >> agree it is essentially a poor man's hwcaps.
> >
> > And the hwcaps is essentially a poor man's replacement for a userspace
> > accessible CPUID instruction enjoyed by x86.
> >
> > It's sad to see that the runtime CPU features detection still remains
> > a PITA with AArch64. Basically, it's not enough to know if the
> > instruction is supported or not. Different microarchitectures may
> > various performance quirks for certain instructions. For example,
> > VFPLite in Cortex-A8 is non-pipelined and slow. Cortex-A15 can
> > dual-issue NEON instructions (nice for the code which can enjoy
> > high ILP), but Cortex-A15 NEON instructions have relatively high
> > latency (bad for the code, which is essentially a long dependency
> > chain). The fastest way to read uncached memory for most ARM
> > processors is to use the VFP load multiple instruction with as
> > many registers as possible, but this is slow on Marvell PJ4. And
> > so on.
> >
> 
> You are comparing apples and oranges.
> 
> It is fairly well known that you are better off using the NEON for
> floating point on a Cortex-A8, if you can afford the reduced
> precision. But if you /can't/ afford the reduced precision, you are
> still better off using VFP-lite than using software emulation.

If the reduced precision of 32-bit floats can't be afforded, it is still
sometimes possible to use more accurate fixed point calculations
instead. And do them faster than using VFP-lite. The generic and slow
software emulation of 64-bit doubles is not even an option.

That's exactly the point. If we know more information about the CPU
capabilities, we can select a more suitable implementation at runtime.
Even the implementation, which uses a somewhat different algorithm
for doing the same job.

> The same applies to the Crypto Extensions: it is highly unlikely that
> you will care about the particular implementation of the AES
> instructions if you are faced with the choice of using those
> instructions or using a software implementation. So using hwcaps bits
> for these kinds of features makes perfect sense. (And so does enabling
> the 'has-vfp' bit for VFP-lite)

I'm not opposing the addition of Crypto Extensions support to hwcaps.
Still this just covers only the basic use cases (which is great!) but
is not enough to make everyone happy.

> I do agree with you that the heterogeneity between various ARM
> implementors is a PITA at times, and knowing which CPU exactly you are
> running on is a valid question in those cases

Again, this was exactly the point of my e-mail. And appears that we
agree with each other.

> (btw this applies to SSE on Atom as well).

I'm well aware of the Atom SSSE3 performance issues (the microcoded
PSHUFB instruction in particular). The key difference is that the x86
architecture allows to easily identify the CPU cores.

> But please don't confuse it with the simple presence or absence of
> some CPU extension.

...

-- 
Best regards,
Siarhei Siamashka



More information about the linux-arm-kernel mailing list