[PATCH v2] arm64/fpsimd: Add interface for kernel use of SVE and SME

Thu Nov 3 13:20:31 PDT 2022

On Thu, 3 Nov 2022 at 19:28, Mark Brown <broonie at kernel.org> wrote:
>
> We currently support in kernel use of FPSIMD via the kernel_neon_begin()
> and kernel_neon_end() interface but there is no corresponding interface
> for SVE or SME. Given that SVE hardware is now becoming widely available
> there is interest in using these more modern floating point instruction
> sets for in kernel applications let's add an interface which allows them
> to be selected in addition to FPSIMD.
>
> The sharing of registers and code means that using kernel_neon_begin()
> is actually doing most of the setup required, the only problem is that we
> are not configuring the vector length so any SVE or SME code would just use
> whatever vector length is configured in the hardware potentially leading to
> uneven performance on systems which support multiple vector lengths. Add
> a new kernel_fp_begin()/end() interface which allows the caller to flag if
> it will use SVE or SME and initialises the vector length if requested.
>
> We allow simultaneous specification of multiple extensions since it is
> possible that a user may wish to mix them in a single algorithm, there is
> no cost to allowing this.
>
> Signed-off-by: Mark Brown <broonie at kernel.org>

The patch looks fine to me, but given that the purpose of this thread
is documentation, I will note that I would expect to see a substantial
benefit before enabling this.

Even though we have some pure SIMD algorithms in the tree, the real
value of kernel mode NEON is that it gives access to special crypto
instructions for which no scalar alternative exists. The performance
boost is easily 10x to 20x here, with the added benefit that AES
instructions are constant time, and scalar AES is not as it is based
on table lookups.

The AES and SHA instructions are also defined in the architecture for
SVE, and so kernel mode SVE would be needed to get access to those
instructions in the kernel.

*However*, this does not mean that SVE is going to be faster for those
algorithms. AES operates on 16-byte quantities, and the SHA algorithms
simply don't have enough parallelism to exploit. In other words, SHA
is not going to be faster, and AES is only going to be faster if the
CPU has multiple AES units running in parallel, which would be quite
costly in terms of gates.

The RAID-5/6 and XOR checksumming could probably benefit from wide SVE
vectors, but I'd still like to see a real world use case that performs
better due to this.

In summary, please don't consider this an invitation for another round
of check-the-box exercises where everything that SVE can do in
principle is implemented in the kernel. We'll need real world numbers.

> ---
>
> v2: Check for system_supports_sme() for setting the SME VL.
>
>  arch/arm64/include/asm/fpsimd.h |  7 +++++
>  arch/arm64/kernel/fpsimd.c      | 45 +++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
>
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index 6f86b7ab6c28..d4045fb73483 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -44,6 +44,13 @@
>   */
>  #define SME_VQ_MAX     16
>
> +#define KERNEL_FP_FPSIMD       1
> +#define KERNEL_FP_SVE          2
> +#define KERNEL_FP_SME          4
> +
> +void kernel_fp_begin(unsigned int flags);
> +void kernel_fp_end(void);
> +
>  struct task_struct;
>
>  extern void fpsimd_save_state(struct user_fpsimd_state *state);
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 23834d96d1e7..01a79a8fe9f6 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -1858,6 +1858,51 @@ void kernel_neon_end(void)
>  }
>  EXPORT_SYMBOL(kernel_neon_end);
>
> +/**
> + * kernel_fp_begin(): obtain the CPU floating point registers for use
> + * by the calling context
> + *
> + * @flags: KERNEL_FP_ flags specifying which FP features will be used.
> + *
> + * The caller is responsible for ensuring that the requested floating
> + * point features are available on the current system.  Task context
> + * in the registers is saved back to memory as necessary.  If SVE or
> + * SME support is enabled then the maximum available vector length
> + * will be selected.
> + *
> + * A matching call to kernel_fp_end() must be made before returning from the
> + * calling context.
> + *
> + * The caller may freely use the floating point registers until
> + * kernel_fp_end() is called.
> + */
> +void kernel_fp_begin(unsigned int flags)
> +{
> +       kernel_neon_begin();
> +
> +       if (system_supports_sve() && (flags & KERNEL_FP_SVE))
> +               sve_set_vq(sve_vq_from_vl(sve_max_vl()) - 1);
> +
> +       if (system_supports_sme() && (flags & KERNEL_FP_SME))
> +               sme_set_vq(sve_vq_from_vl(sme_max_vl()) - 1);
> +}
> +EXPORT_SYMBOL(kernel_fp_begin);
> +
> +/**
> + * kernel_fp_end(): end kernel usage of the floating point registers
> + *
> + * Must be called from a context in which kernel_fp_begin() was previously
> + * called, with no call to kernel_fp_end() in the meantime.
> + *
> + * The caller must not use the FPSIMD registers after this function is called,
> + * unless kernel_fp_begin() is called again in the meantime.
> + */
> +void kernel_fp_end(void)
> +{
> +       kernel_neon_end();
> +}
> +EXPORT_SYMBOL(kernel_fp_end);
> +
>  #ifdef CONFIG_EFI
>
>  static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state);
> --
> 2.30.2
>