[PATCH] arm64: Add support for Half precision floating point

Adhemerval Zanella adhemerval.zanella at linaro.org
Tue Feb 2 10:28:02 PST 2016



On 02-02-2016 16:25, Szabolcs Nagy wrote:
> On 02/02/16 18:12, Adhemerval Zanella wrote:
>> On 02-02-2016 15:31, Szabolcs Nagy wrote:
>>> On 28/01/16 16:51, Adhemerval Zanella wrote:
>>>> On 28-01-2016 14:07, Will Deacon wrote:
>>>>> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>>>>>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>>>>>> mainly because he had a couple of additional ideas on the kernel
>>>>>> front.
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>>>>>> On 26/01/16 16:02, Will Deacon wrote:
>>>>>>>> Hi Suzuki,
>>>>>>>>
>>>>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>>>>>> half precision(16bit) floating point/asimd data processing
>>>>>>>>> instructions. This patch adds support for detecting and exposing
>>>>>>>>> the same to the userspace via HWCAPs
>>>>>>>
>>>>>>>
>>>>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>>>>
>>>>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>>>>
>>>>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>>>>
>>>>>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>>>>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>>>>>> (in private) that they had some plans to take a look at it.
>>>>>>
>>>>>> I believe one of Adhemerval's ideas was similar to what I had
>>>>>> mentioned back then, which was to provide all of the CPU information
>>>>>> in a single file instead of having to traverse a directory structure.
>>>>>
>>>>> My understanding was that libc needed this information extremely early
>>>>> on (i.e. before it could even issue system calls), and therefore such
>>>>> an approach would be in addition to the proposal here. Am I mistaken?
>>>>
>>>> If the idea is to use these instruction for function implementation selection
>>>> (iFUNC) the idea is to have on PLT resolution either by accessing it directly
>>>> or using a caching mechanism. x86_64 does something similar with cacheline
>>>> information: it issues a single cpuid and create processor information table
>>>> based on its information (it is also what the __builtin_supports() also
>>>> does).
>>>>
>>>
>>> __builtin_supports is not a single cpuid on x86, it is
>>> a cpuid per dso with one cache per dso.
>>>
>>> (gcc-5 used a single cache in libgcc_s.so.1 and that
>>> turned out to be broken because ifunc in other dsos
>>> could not reliably access it.)
>>
>> It is with static libgcc (default), but if you use -shared-gcc only one
>> __cpu_model (used by __builtin_cpu_supports) will be linked.  But since
>> static libgcc is default it will be indeed one per DSO.
> 
> with shared libgcc x86 fmv is broken, the ifunc
> resolver may run before libgcc gets relocated.
> 
> fwiw shared libgcc is also broken on arm with old kernels.
> (because it aborts if 64bit atomics is not supported,
> the check assumes it only gets linked in if user code
> uses 64bit atomics, but with shared libgcc the check
> is always done.)
> 
> so i dont think shared libgcc is well supported..
> 
>>>>>> The other idea was to add a vDSO function that returns this data so as
>>>>>> to avoid (or at least reduce) the context switch latency.
>>>>>
>>>>> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
>>>>> similar things in the past (something on PPC?) and have horror stories
>>>>> from that.
>>>>
>>>> In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
>>>> information like the LPAR cfg, platform, processor, {d,i}cache, etc.
>>>> I recall that I have see some code back at IBM that tried to use these
>>>> fields directly, but indeed it is not recommended.
>>>>
>>>> What I have in mind is something what ppc does with __kernel_get_syscall_map.
>>>> It is vDSO function that returns a vDSO internal data related to which
>>>> syscalls are implemented in the running kernel (through a bitmap field).
>>>>
>>>
>>> fs access or vdso does not work for ifunc based dispatch
>>> (assuming the current ifunc implementation in glibc).
>>>
>>> (for vdso you need the AT_SYSINFO_EHDR auxval somehow and
>>> then implement elf symbol lookup in the ifunc resolver
>>> without calling any libc function. passing auxvals to the
>>> ifunc resolver can be done by changing the ifunc abi, but
>>> doing symbol lookups there is unrealistic.)
>>>
>>> in the libc (e.g. for memcpy) ifunc is a bit easier to use,
>>> but in user code (function-multi-versioning) ifunc is very
>>> limited.
>>>
>>> i wrote about the ifunc limitations here:
>>> https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html
>>> see point (4) and (5).
>>>
>>
>> I recall this thread and indeed iFUNC have a set of limitations.  Although for
>> use within libc itself it might be safe with the constraints you have described.
>>
>> Now for vDSO usage I think it might be safe to use within GLIBC
>> with correct vDSO pointers initialization order. At least it is done
>> on GLIBC for gettimeofday for x86_64 and powerpc (the iFUNC returns
>> the vDSO function pointer).
>>
> 
> i don't see how that can work with static linking.
> (vdso setup happens after ifunc resolvers are run)

Direct syscalls are used for static case. I didn't yet dig into why exactly
vDSO setup happens after ifunc and if it is possible to change it to
enable this for static linking as well.



More information about the linux-arm-kernel mailing list