[PATCH v6] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state

Thu May 30 08:40:47 PDT 2024

On Thu, May 30, 2024 at 03:03:42PM +0100, Dave Martin wrote:
> On Wed, May 29, 2024 at 08:46:23PM +0100, Mark Brown wrote:

> > When we are in a syscall we take the opportunity to discard the SVE state,
> > saving only the FPSIMD subset of the register state. When we reload the
> > state from memory we reenable SVE access traps, stopping tracking SVE until
> > the task takes another SVE access trap. This means that for a task which is
> > actively using SVE many blocking system calls will have the additional
> > overhead of a SVE access trap.

> Playing devil's advocate here: doesn't a blocking syscall already imply
> a high overhead (at least in terms of latency for the thread concerned)?

> i.e., does letting TIF_SVE linger across some blocking syscalls make a
> meaningful difference in some known use case?

I don't have any solid examples since I didn't do physical benchmarking
yet (the only thing I have access to is Graviton, I need to figure out
deploying kernels there for test).  I am always slightly nervous about
only benchmarking on big cores, though for something like this they
might well get more benefit anyway.

> Could your instrumentation be extended to build a histogram of the delay
> between successive SVE traps per task?

You could do that instrumentation, what I had was extremely crude
though.

> There's a difference here between a task that takes a lot of traps in a
> burst (perhaps due to startup or a specific library call), versus a task
> that uses SVE sporadically for all time.

The pattern I was seeing was mostly either a very small number of SVE
uses during startup and then none for the rest of the runtime or
otherwise SVE used on comfortably over 50% of the restores so I didn't
look super hard.  I'd expect you will see some things more in the middle
where a non-SVE application uses a SVE enabled library (eg, a JIT that
doesn't generate SVE code doing calls out occasionally to a C library
built with SVE) though.

> I wonder whether the sweet spot for the timeout may be quite a lot
> shorter than a second.  Still, once we have something we can tune, we
> can always tune it later as you suggest.

My instinct is that a second is probably on the high end, yeah.  Given
that it's just pulling numbers out of thin air I went high to minimise
the impact on tasks that are actively using SVE for a long time,
guessing that short running tasks are likely to not get scheduled so
often anyway and they only pay the cost when preempted.

> > +			if (time_after(jiffies, current->thread.sve_timeout)) {
> > +				clear_thread_flag(TIF_SVE);
> > +				sve_user_disable();
> > +			} else {
> > +				sve_flush_live(true, sve_vq_minus_one);

> Didn't we already flush Zn[max:128] as a side-effect of loading the
> V-regs in fpsimd_load_state() above?

Looking at the pseudoode for V[] that does appear to be true for the Z
registers since we are now setting the vector length, but we do still
need to worry about the P and FFR registers so there still needs to be
some flush.  Unless we're both missing something there there's room for
an optimisation here, it won't make a difference for 128 bit VLs since
we skip the Z register flush there anyway but it would help with larger
VLs.

> Also, unless I'm missing something, prior to this patch we could just
> fall through this code with TIF_SVE still set, suggesting that either
> this flush is not needed for some reason, or it is shadowed by another
> flush done somewhere else, or a flush is currenly needed but missing.
> Am I just getting myself confused here?

With the current code we will disable SVE so the contents of the state
only accessible with SVE become immaterial, we deal with it when we take
an access trap.

With this patch we will leave SVE enabled so we do need to ensure that
the visible SVE state is flushed.  As you point out we no longer need to
flush the Z registers even for larger VLs since SVE is always enabled
for EL1 and this patch means that we always set the vector length if the
task has TIF_SVE.  Setting the VL means that the constrained
unpredictable choice between zeroing the maximum or currently visible
upper bits in the Z registers will always zero all the bits that EL0
could access.  We do however continue to need to flush the predicate
registers.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20240530/35dfd2bd/attachment.sig>