[PATCH v2 0/4] arm64: Run kernel mode NEON with preemption enabled

Eric Biggers ebiggers at kernel.org
Thu Nov 23 16:11:39 PST 2023


Hi Ard,

On Thu, Nov 23, 2023 at 06:44:34PM +0100, Ard Biesheuvel wrote:
> 
> Currently, kernel mode NEON (SIMD) support is implemented in a way that
> requires preemption to be disabled while the SIMD registers are live.
> The reason for this is that those registers are not in the set that is
> preserved/restored on exception entry/exit and context switch, as this
> would impact performance generally, even for workloads where kernel mode
> SIMD is not the bottleneck.
> 
> However, doing substantial work with preemption disabled is not great,
> as it affects scheduling latency, which is especially problematic for
> real-time use cases. So ideally, we should keep preemption enabled when
> we can, and find another way to ensure that this does not corrupt the
> NEON register state of in-kernel SIMD users.
> 
> This series implements a suggestion by Mark Rutland, and introduces a
> thread_info flag TIF_USING_KMODE_NEON, which indicates to the thread
> switch machinery that the task in question has live kernel mode SIMD
> state which needs to be preserved and restored. The space needed for
> this is allocated in thread_struct. (*)
> 
> Given that currently, we run kernel mode NEON with softirqs disabled (to
> avoid the need for preserving kernel mode NEON context belonging to task
> context while the SIMD unit is being used by code running in softirq
> context), just removing the preempt_disable/enable calls is not
> sufficient, and we also need to leave softirqs enabled. This means that
> we may need to preserve kernel mode NEON state not only on a context
> switch, but also when code running in softirq context takes ownership of
> the SIMD unit, but this is straight-forward once we add the scratch
> space to thread_struct.
> 
> (*) We might decide to allocate this space (~512 bytes) dynamically, if
> the thread_struct memory footprint causes issues. However, we should
> also explore doing the same for the user space FPSIMD state, as kernel
> threads never return to user space and have no need for this allocation.
> 
> v2:
> - tweak some commit logs for clarity
> - integrate with the existing lazy restore logic
> - add Mark's R-b to patch #1
> 
> Cc: Marc Zyngier <maz at kernel.org>
> Cc: Will Deacon <will at kernel.org>
> Cc: Mark Rutland <mark.rutland at arm.com>
> Cc: Kees Cook <keescook at chromium.org>
> Cc: Catalin Marinas <catalin.marinas at arm.com>
> Cc: Mark Brown <broonie at kernel.org>
> Cc: Eric Biggers <ebiggers at google.com>
> 
> Ard Biesheuvel (4):
>   arm64: fpsimd: Drop unneeded 'busy' flag
>   arm64: fpsimd: Preserve/restore kernel mode NEON at context switch
>   arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD
>   arm64: crypto: Remove conditional yield logic
> 
>  arch/arm64/crypto/aes-glue.c         |  21 +--
>  arch/arm64/crypto/aes-modes.S        |   2 -
>  arch/arm64/crypto/sha1-ce-core.S     |   2 -
>  arch/arm64/crypto/sha1-ce-glue.c     |  19 +--
>  arch/arm64/crypto/sha2-ce-core.S     |   2 -
>  arch/arm64/crypto/sha2-ce-glue.c     |  19 +--
>  arch/arm64/crypto/sha3-ce-core.S     |   4 +-
>  arch/arm64/crypto/sha3-ce-glue.c     |  14 +-
>  arch/arm64/crypto/sha512-ce-core.S   |   2 -
>  arch/arm64/crypto/sha512-ce-glue.c   |  16 +-
>  arch/arm64/include/asm/assembler.h   |  29 ----
>  arch/arm64/include/asm/processor.h   |   3 +
>  arch/arm64/include/asm/simd.h        |  11 +-
>  arch/arm64/include/asm/thread_info.h |   1 +
>  arch/arm64/kernel/asm-offsets.c      |   4 -
>  arch/arm64/kernel/fpsimd.c           | 162 +++++++++++++-------
>  16 files changed, 138 insertions(+), 173 deletions(-)

Thanks for doing this!  This seems like a very nice improvement, as long as the
extra 528 bytes in thread_struct is not too much of an issue.

The arch/arm64/crypto/ changes seem to miss removing some of the places that do
the "yield every 4096 bytes" thing:

    git grep '\<4096\|SZ_4K\>' arch/arm64/crypto/
    arch/arm64/crypto/aes-ce-ccm-glue.c:            n = min_t(u32, n, SZ_4K); /* yield NEON at least every 4k */
    arch/arm64/crypto/aes-ce-ccm-glue.c:            if (len / SZ_4K > (len - n) / SZ_4K) {
    arch/arm64/crypto/chacha-neon-glue.c:           unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
    arch/arm64/crypto/crct10dif-ce-glue.c:                  if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE)
    arch/arm64/crypto/crct10dif-ce-glue.c:                          chunk = SZ_4K;
    arch/arm64/crypto/crct10dif-ce-glue.c:                  if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE)
    arch/arm64/crypto/crct10dif-ce-glue.c:                          chunk = SZ_4K;
    arch/arm64/crypto/nhpoly1305-neon-glue.c:               unsigned int n = min_t(unsigned int, srclen, SZ_4K);
    arch/arm64/crypto/poly1305-glue.c:                              unsigned int todo = min_t(unsigned int, len, SZ_4K);
    arch/arm64/crypto/polyval-ce-glue.c:            nblocks = min(srclen, 4096U) / POLYVAL_BLOCK_SIZE;

Those can all be cleaned up too, right?

Please make sure to also update the comments above the assembly functions as
needed.  For example, in the following, the return type changed to void:

        /*
         * int __sha1_ce_transform(struct sha1_ce_state *sst, u8 const *src,
         *                         int blocks)
         */
        SYM_FUNC_START(__sha1_ce_transform)

Can you also Cc linux-crypto on future versions of this patch series?

Thanks!

- Eric



More information about the linux-arm-kernel mailing list