[PATCH v6] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state

Dave Martin Dave.Martin at arm.com
Thu May 30 07:03:42 PDT 2024


Hi Mark,

On Wed, May 29, 2024 at 08:46:23PM +0100, Mark Brown wrote:
> When we are in a syscall we take the opportunity to discard the SVE state,
> saving only the FPSIMD subset of the register state. When we reload the
> state from memory we reenable SVE access traps, stopping tracking SVE until
> the task takes another SVE access trap. This means that for a task which is
> actively using SVE many blocking system calls will have the additional
> overhead of a SVE access trap.

Playing devil's advocate here: doesn't a blocking syscall already imply
a high overhead (at least in terms of latency for the thread concerned)?

i.e., does letting TIF_SVE linger across some blocking syscalls make a
meaningful difference in some known use case?


(For non-blocking syscalls the argument for allowing TIF_SVE to linger
seems a lot stronger.)

> As SVE deployment is progressing we are seeing much wider use of the SVE
> instruction set, including performance optimised implementations of
> operations like memset() and memcpy(), which mean that even tasks which are
> not obviously floating point based can end up with substantial SVE usage.
> 
> It does not, however, make sense to just unconditionally use the full SVE
> register state all the time since it is larger than the FPSIMD register
> state so there is overhead saving and restoring it on context switch and
> our requirement to flush the register state not shared with FPSIMD on
> syscall also creates a noticeable overhead on system call.
> 
> I did some instrumentation which counted the number of SVE access traps
> and the number of times we loaded FPSIMD only register state for each task.
> Testing with Debian Bookworm this showed that during boot the overwhelming
> majority of tasks triggered another SVE access trap more than 50% of the
> time after loading FPSIMD only state with a substantial number near 100%,
> though some programs had a very small number of SVE accesses most likely
> from startup. There were few tasks in the range 5-45%, most tasks either
> used SVE frequently or used it only a tiny proportion of times. As expected
> older distributions which do not have the SVE performance work available
> showed no SVE usage in general applications.
> 
> This indicates that there should be some useful benefit from reducing the
> number of SVE access traps for blocking system calls like we did for non
> blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
> on syscall if we don't context switch"). Let's do this with a timeout, when
> we take a SVE access trap record a jiffies after which we'll reeanble SVE
> traps then check this whenver we load a FPSIMD only floating point state
> from memory. If the time has passed then we reenable traps, otherwise we
> leave traps disabled and flush the non-shared register state like we would
> on trap.
> 
> The timeout is currently set to a second, I pulled this number out of thin
> air so there is doubtless some room for tuning. This means that for a
> task which is actively using SVE the number of SVE access traps will be
> substantially reduced but applications which use SVE only very
> infrequently will avoid the overheads associated with tracking SVE state
> after a second. The extra cost from additional tracking of SVE state
> only occurs when a task is preempted so short running tasks should be
> minimally affected.

Could your instrumentation be extended to build a histogram of the delay
between successive SVE traps per task?

There's a difference here between a task that takes a lot of traps in a
burst (perhaps due to startup or a specific library call), versus a task
that uses SVE sporadically for all time.

I wonder whether the sweet spot for the timeout may be quite a lot
shorter than a second.  Still, once we have something we can tune, we
can always tune it later as you suggest.

> 
> There should be no functional change resulting from this, it is purely a
> performance optimisation.
> 
> Signed-off-by: Mark Brown <broonie at kernel.org>
> ---
> Changes in v6:
> - Rebase onto v6.10-rc1.
> - Link to v5: https://lore.kernel.org/r/20240405-arm64-sve-trap-mitigation-v5-1-126fe2515ef1@kernel.org
> 
> Changes in v5:
> - Rebase onto v6.9-rc1.
> - Use a timeout rather than number of state loads to decide when to
>   reenable traps.
> - Link to v4: https://lore.kernel.org/r/20240122-arm64-sve-trap-mitigation-v4-1-54e0d78a3ae9@kernel.org
> 
> Changes in v4:
> - Rebase onto v6.8-rc1.
> - Link to v3: https://lore.kernel.org/r/20231113-arm64-sve-trap-mitigation-v3-1-4779c9382483@kernel.org
> 
> Changes in v3:
> - Rebase onto v6.7-rc1.
> - Link to v2: https://lore.kernel.org/r/20230913-arm64-sve-trap-mitigation-v2-1-1bdeff382171@kernel.org
> 
> Changes in v2:
> - Rebase onto v6.6-rc1.
> - Link to v1: https://lore.kernel.org/r/20230807-arm64-sve-trap-mitigation-v1-1-d92eed1d2855@kernel.org
> ---
>  arch/arm64/include/asm/processor.h |  1 +
>  arch/arm64/kernel/fpsimd.c         | 42 ++++++++++++++++++++++++++++++++------
>  2 files changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index f77371232d8c..7a6ed0551291 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -164,6 +164,7 @@ struct thread_struct {
>  	unsigned int		fpsimd_cpu;
>  	void			*sve_state;	/* SVE registers, if any */
>  	void			*sme_state;	/* ZA and ZT state, if any */
> +	unsigned long		sve_timeout;    /* jiffies to drop TIF_SVE */
>  	unsigned int		vl[ARM64_VEC_MAX];	/* vector length */
>  	unsigned int		vl_onexec[ARM64_VEC_MAX]; /* vl after next exec */
>  	unsigned long		fault_address;	/* fault info */
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 82e8a6017382..4741e4fb612a 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -354,6 +354,7 @@ static void task_fpsimd_load(void)
>  {
>  	bool restore_sve_regs = false;
>  	bool restore_ffr;
> +	unsigned long sve_vq_minus_one;
>  
>  	WARN_ON(!system_supports_fpsimd());
>  	WARN_ON(preemptible());
> @@ -365,18 +366,12 @@ static void task_fpsimd_load(void)
>  	if (system_supports_sve() || system_supports_sme()) {
>  		switch (current->thread.fp_type) {
>  		case FP_STATE_FPSIMD:
> -			/* Stop tracking SVE for this task until next use. */
> -			if (test_and_clear_thread_flag(TIF_SVE))
> -				sve_user_disable();
>  			break;
>  		case FP_STATE_SVE:
>  			if (!thread_sm_enabled(&current->thread) &&
>  			    !WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE)))
>  				sve_user_enable();
>  
> -			if (test_thread_flag(TIF_SVE))
> -				sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
> -
>  			restore_sve_regs = true;
>  			restore_ffr = true;
>  			break;
> @@ -395,6 +390,15 @@ static void task_fpsimd_load(void)
>  		}
>  	}
>  
> +	/*
> +	 * If SVE has been enabled we may keep it enabled even if
> +	 * loading only FPSIMD state, so always set the VL.
> +	 */
> +	if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
> +		sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
> +		sve_set_vq(sve_vq_minus_one);
> +	}
> +
>  	/* Restore SME, override SVE register configuration if needed */
>  	if (system_supports_sme()) {
>  		unsigned long sme_vl = task_get_sme_vl(current);
> @@ -421,6 +425,25 @@ static void task_fpsimd_load(void)
>  	} else {
>  		WARN_ON_ONCE(current->thread.fp_type != FP_STATE_FPSIMD);
>  		fpsimd_load_state(&current->thread.uw.fpsimd_state);
> +
> +		/*
> +		 * If the task had been using SVE we keep it enabled
> +		 * when loading FPSIMD only state for a period to
> +		 * minimise overhead for tasks actively using SVE,
> +		 * disabling it periodicaly to ensure that tasks that
> +		 * use SVE intermittently do eventually avoid the
> +		 * overhead of carrying SVE state.  The timeout is
> +		 * initialised when we take a SVE trap in in
> +		 * do_sve_acc().
> +		 */
> +		if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
> +			if (time_after(jiffies, current->thread.sve_timeout)) {
> +				clear_thread_flag(TIF_SVE);
> +				sve_user_disable();
> +			} else {
> +				sve_flush_live(true, sve_vq_minus_one);

Didn't we already flush Zn[max:128] as a side-effect of loading the
V-regs in fpsimd_load_state() above?

Also, unless I'm missing something, prior to this patch we could just
fall through this code with TIF_SVE still set, suggesting that either
this flush is not needed for some reason, or it is shadowed by another
flush done somewhere else, or a flush is currenly needed but missing.
Am I just getting myself confused here?

(Or, do the deletions from the switch in the earlier hunk cancel this
out?)

> +			}
> +		}
>  	}
>  }
>  
> @@ -1397,6 +1420,13 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
>  
>  	get_cpu_fpsimd_context();
>  
> +	/*
> +	 * We will keep SVE enabled when loading FPSIMD only state for
> +	 * the next second to minimise traps when userspace is
> +	 * actively using SVE.
> +	 */
> +	current->thread.sve_timeout = jiffies + HZ;
> +
>  	if (test_and_set_thread_flag(TIF_SVE))
>  		WARN_ON(1); /* SVE access shouldn't have trapped */
>  
> 
> ---
> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> change-id: 20230807-arm64-sve-trap-mitigation-2e7e2663c849
> 
> Best regards,
> -- 
> Mark Brown <broonie at kernel.org>

[...]

Cheers
---Dave



More information about the linux-arm-kernel mailing list