[RFC v2 PATCH 2/4] ARM64: add support for kernel mode NEON in atomic context

Sun Oct 13 18:48:40 EDT 2013

On 11 Oct 2013, at 21:09, Nicolas Pitre <nicolas.pitre at linaro.org> wrote:
> On Fri, 11 Oct 2013, Catalin Marinas wrote:
>> On 11 Oct 2013, at 18:30, Ard Biesheuvel <ard.biesheuvel at linaro.org> wrote:
>>>> On 11 okt. 2013, at 19:14, Catalin Marinas <catalin.marinas at arm.com> wrote:
>>>>> On Wed, Oct 09, 2013 at 07:50:32PM +0100, Ard Biesheuvel wrote:
>>>>> --- a/arch/arm64/include/asm/neon.h
>>>>> +++ b/arch/arm64/include/asm/neon.h
>>>>> @@ -8,7 +8,38 @@
>>>>> * published by the Free Software Foundation.
>>>>> */
>>>>> 
>>>>> +#include <linux/hardirq.h>
>>>>> +#include <linux/types.h>
>>>>> +#include <asm/fpsimd.h>
>>>>> +
>>>>> #define cpu_has_neon()        (1)
>>>>> 
>>>>> +#define DEFINE_NEON_STACK_REGS(a, num)                    \
>>>>> +    struct {                            \
>>>>> +        struct fpsimd_partial_state regs;            \
>>>>> +        __uint128_t vregs[(num) > 32 ? 32 : ((num) + 1) & ~1U];    \
>>>>> +    } a = { .regs.num_regs = sizeof(a.vregs) / sizeof(__uint128_t) }
>>>>> +
>>>>> +#define DEFINE_NEON_STACK_REGS_ALL(name)    DEFINE_NEON_STACK_REGS(name, 32)
>>>>> +
>>>>> void kernel_neon_begin(void);
>>>>> void kernel_neon_end(void);
>>>>> +
>>>>> +static inline void __kernel_neon_begin_atomic(struct fpsimd_partial_state *regs)
>>>>> +{
>>>>> +    if (!in_interrupt())
>>>>> +        kernel_neon_begin();
>>>>> +    else
>>>>> +        fpsimd_save_partial_state(regs);
>>>>> +}
>>>>> +
>>>>> +static inline void __kernel_neon_end_atomic(struct fpsimd_partial_state *regs)
>>>>> +{
>>>>> +    if (!in_interrupt())
>>>>> +        kernel_neon_end();
>>>>> +    else
>>>>> +        fpsimd_load_partial_state(regs);
>>>>> +}
>>>> 
>>>> The _atomic suffix is a bit misleading (you basically mean no user
>>>> context). I wonder whether it's better to have some _fast/_slow variants
>>>> instead. Looking at the other two patches, you only need 2 or 4
>>>> registers to do the crypto stuff but if you are not in_interrupt(), you
>>>> basically save and restore the full NEON bank. I would say for such
>>>> cases just make kernel_neon_begin_fast() call which is safe in all
>>>> contexts and much faster.
>>> 
>>> I agree that the name is a bit misleading.
>>> 
>>> Regarding fast/slow: if you take the core aes cipher as an example, it
>>> will likely be called from a loop somewhere, and (assuming lazy restore
>>> gets merged at some point) you may be stacking and unstacking 2 or 4
>>> registers many times while kernel_neon_begin() would just stack them
>>> once and let the lazy restore unstack them only when needed.
>>> 
>>> This is probably a detail where arm and arm64 will be implemented
>>> somewhat differently.  I would still like to align the api between the
>>> two, if possible, so intrinsics or gcc vectorized code can be shared
>>> easily.  However, as ARM has fewer use cases where using only 2
>>> registers makes sense, (memcpy perhaps?) and already has lazy restore
>>> wired up, I personally feel hooking into the lazy restore in the
>>> !in_interrupt() case is still the best solution there.
>> 
>> Lazy saving/restoring on context switch may not be beneficial on arm64
>> and I don't want to enable it until I see some real user space
>> benchmarks (not kernel crypto API benchmarks).
> 
> I think it is more important to establish the API semantics here.  
> Implementation may vary afterwards.
> 
> The difference right now between kernel_neon_begin() and 
> __kernel_neon_begin_atomic() is that the later can be stacked while the 
> former cannot.  

How much stacking do we need?  If we limit the nesting to two levels
(process and IRQ context), we could pre-allocate per-CPU
fpsimd_state structures for interrupt context and always use the same
API. About softirqs, do we need another level of nesting?

> Maybe the API should be kernel_neon_begin() and 
> kernel_neon_begin_partial(nb_regs), the former being a simple alias to 
> the later with the full register set as argument.  And then the actual 
> register saving method (whether it is an atomic context or not, the 
> number of registers, etc.) could be handled and optimized internally 
> instead of exposing such implementation constraints to users of the API.

It could be more efficient to always specify the number of registers to
be saved/restored even for kernel_neon_begin().  But I haven't paid much
attention to the register use in the actual crypto algorithms.

Catalin