[PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps

Mon Feb 8 12:26:10 EST 2021

On Mon, Feb 01, 2021 at 12:28:59PM +0000, Mark Brown wrote:
> This patch series aims to improve the performance of handling SVE access
> traps, earlier versions were originally written by Julien Gral but based
> on discussions on previous versions the patches have been substantially
> reworked to use a different approach.  The patches are now different
> enough that I set myself as the author, hopefully that's OK for Julien.
> 
> I've marked this as RFC since it's not quite ready yet but I'd really
> like feedback on the overall approach, it's a big change in
> implementation.  It needs at least one more pass for polish and while
> it's holding up in my testing thus far I've not done as much as I'd like
> yet.
> 
> Per the syscall ABI, SVE registers will be unknown after a syscall.  In
> practice, the kernel will disable SVE and the registers will be zeroed
> (except the first 128 bits of each vector) on the next SVE instruction.
> Currently we do this by saving the FPSIMD state to memory, converting to
> the matching SVE state and then reloading the registers on return to
> userspace.  This requires a lot of memory accesses that we shouldn't
> need, improve this by reworking the SVE state tracking so we track if we
> should trap on executing SVE instructions separately to if we need to
> save the full register state.  This allows us to avoid tracking the full
> SVE state until we need to return to userspace and to convert directly
> in registers in the common case where the FPSIMD state is still in
> registers then.
> 
> As with current mainline we disable SVE on every syscall.  This may not
> be ideal for applications that mix SVE and syscall usage, strategies
> such as SH's fpu_counter may perform better but we need to assess the
> performance on a wider range of systems than are currently available
> before implementing anything.
> 
> It is also possible to optimize the case when the SVE vector length
> is 128-bit (ie the same size as the FPSIMD vectors).  This could be
> explored in the future, it becomes a lot easier to do with this
> implementation.
> 
> v7:
>  - A few minor cosmetic updates and one bugfix for
>    fpsimd_update_current_state().

I see the following delta in this function:

@@ -1272,7 +1272,8 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
 	get_cpu_fpsimd_context();
 
 	current->thread.uw.fpsimd_state = *state;
-	clear_thread_flag(TIF_SVE_FULL_REGS);
+	if (test_thread_flag(TIF_SVE_FULL_REGS))
+		fpsimd_to_sve(current);
 
 	task_fpsimd_load();
 	fpsimd_bind_task_to_cpu();


Can you elaborate on that?

> v6:
>  - Substantially rework the patch so that TIF_SVE is now replaced by
>    two flags TIF_SVE_EXEC and TIF_SVE_FULL_REGS.
>  - Return to disabling SVE after every syscall as for current
>    mainine rather than leaving it enabled unless reset via ptrace.

Some general thoughts before I get into the detail:

I think the implementation flow in this series is roughly:

 a) Split TIF_SVE into two flags, but keep the flags in lockstep, and
    don't pretend to handle cases where they are unequal (no functional
    change).

 b) Add handling for cases where the flags are unequal (the only
    meaningful case being TIF_SVE_EXEC && !TIF_SVE_FULL_REGS) (purely
    dead code addition; no functional change).

 c) Add code that causes / resolves inequality of the flags in order to
    achieve the goal of the series (functional change).

Does that sound about right?  Patch 2 just does (c), broadly speaking,
I couldn't convince myself whether patch 1 is introducing functional
changes or not.

The KVM parts of (b) could maybe benefit from their own commit message.
If (b) were all in a separate patch by itself, it would be
straightfoward to split the host and KVM changes up.

Maybe splitting patch 1 into two would help clarify the situation.


Second, TIF_SVE_EXEC is advertised as being a pure indication of whether
the task is to be allowed to use SVE in userspace.  I find that this
jars a little with the fact that we leave TIF_SVE_EXEC set in the
!FULL_REGS case -- where we cannot enter userspace without doing some
additional work.  This is not necessarily a problem, but it was one
extra thing to get my head around.

There seems to be a preexisting convention that a set thread flag
implies a "dirtier" condition than when the flag is clear, though
obviously it works either way.  TIF_SVE_FULL_REGS here is really a work-
not-needed flag (whereas TIF_FOREIGN_FPSTATE and various others are
work-needed flags).  This also is not necessarily a problem, but I find
it a bit confusing.  Inverting the sense of TIF_SVE_FULL_REGS, or
a more explicit name such as TIF_SVE_FULL_REGS_VALID might make it
clearer that there is work to do in the "not valid" case and we cannot
blindly enter userspace or save the task state without taking any notice
of the flag.

These are not intended as criticisms -- it's hard to slice up the states
we need into nice orthogonal concepts, and I'm still waiting for this
refactored view to settle properly in my head.


Ultimately, we want a decision point somewhere about whether to fall
back to FPSIMD-only mode for a task or not, even if the decision is
trivial for now.  Where do you think that sits?  I'd generally assumed
that it would go in fpsimd_save() since that's where we take the hit for
saving (and later restoring) the state.  That function isn't preemptible
though, so it might also make sense to make the decision elsewhere if
there are suitable places to do it.

I say this because it is important for this series to make a foundation
that such a decision can be slotted naturally into -- that's part of the
motiviation of making these changes in the first place (IIUC, anyway).

[...]

Cheers
---Dave