[PATCH] ARM: document the use of NEON in kernel mode

Sat Aug 10 10:09:25 EDT 2013

On 10 August 2013 15:34, Domenico Andreoli <cavokz at gmail.com> wrote:
> Hi Ard!
>

Ciao!

> On Fri, Aug 09, 2013 at 02:21:09PM +0200, Ard Biesheuvel wrote:

[...]

>> +If latency is a concern, it is possible to put back to back calls to
>> +kernel_neon_end() and kernel_neon_begin() in places in your code where none of
>> +the NEON registers are live. (Additional calls to kernel_neon_begin() should be
>> +reasonably cheap if no context switch occurred in the meantime)
>
> I don't understand. How latency concerns could be relieved with scattering
> non-NEON code with calls to kernel_neon_end() and kernel_neon_begin()?
>

The idea is that by doing something like

kernel_neon_begin();
... use the NEON for a very long time ...
kernel_neon_end();

can in some cases be changed to

kernel_neon_begin();
... use the NEON for not so long a time ...
kernel_neon_end();
kernel_neon_begin();
... use the NEON for not so long a time ...
kernel_neon_end();

Note that kernel_neon_end() re-enables preemption (in the
CONFIG_PREEMPT case) which triggers a context switch if a higher
priority task is pending.
The point I am trying to make is that
a) the second call to kernel_neon_begin() is not as costly as the
first one if in fact no context switch occurred,
b) you should only put this in places where clobbering the NEON
registers is allowable (i.e., no NEON registers are live)

> I expect such NEON code would be rather at leaves of the call trees,
> so there should not be so many functions called with disabled preemption
> from within a NEON critical section, right?
>

The point is that the NEON code itself runs with preemption disabled,
and may take some time to complete, so you can trade some speed for
lower latency by releasing and re-acquiring the NEON unit more often
(but only in places where you can tolerate losing your NEON register
contents).

> Definitively I don't know the complexity of code that could benefit
> from NEON.
>

On Cortex-A15, I saw:
- 60% speedup in xor_blocks
- 400% speedup in RAID6
- 50% speedup in AES (CTR and XTS-encrypt modes, and potentially CCMP)

so there is definitely a case for NEON in the kernel.

[...]

>> +NEON assembler
>> +--------------
>> +NEON assembler is supported with no additional caveats as long as the rules
>> +above are followed.
>> +
>> +
>> +NEON code generated by GCC
>> +--------------------------
>> +The GCC option -ftree-vectorize (implied by -O3) tries to exploit implicit
>> +parallelism, and generates NEON code from ordinary C source code. This is fully
>> +supported as long as the rules above are followed.
>
> It's not clear to me the purpose of this paragraph.
>

The purpose of the document is to explain how kernel mode NEON is
intended to be used. In order to cover as many potential use cases as
possible, this paragraph is included to explain that not only explicit
NEON like assembler or intrinsics can be used, but also implicit NEON
like the code GCC generates in some cases.

> That -O3 should be not used anywhere in the kernel except for those units
> already built with -mfpu=neon -mfloat-abi=softfp?
>

People may have valid reasons for preferring -O3 in some places. They
should just be aware that combining -O3 with -mfpu=neon can result in
NEON code turning up anywhere, not just in the places where intrinsics
or (inline) assembler were used.

[...]

Regards,
Ard.