[PATCH v2 7/7] arm64/sve: Don't zero non-FPSIMD register state on syscall by default

Tue Jul 19 12:35:46 PDT 2022

On Tue, Jul 19, 2022 at 06:35:37PM +0100, Catalin Marinas wrote:
> On Mon, Jun 20, 2022 at 01:41:58PM +0100, Mark Brown wrote:

> > The documented syscall ABI specifies that the SVE state not shared with
> > FPSIMD is undefined after a syscall. Currently we implement this by
> > always flushing this register state to zero, ensuring consistent
> > behaviour but introducing some overhead in the case where we can return
> > directly to userspace without otherwise needing to update the register
> > state. Take advantage of the flexibility offered by the documented ABI
> > and instead leave the SVE registers untouched in the case where can
> > return directly to userspace.

> Do you have some rough numbers to quantify the gain? I suspect the
> vector length doesn't matter much.

I got some benchmarking done with fp-pidbench (which like I think I say
somewhere in the series is a nonsense benchmark but still) which showed
an IIRC approximately 1% or so win from a similar change that was fairly
consistent over a few implementations.  Unfortunately I don't yet have
access to systems that would allow me to benchmark directly and
nobody's got back with numbers for this specific code, the numbers were
with some earlier proof of concept work.

There is some vector length dependency when we move over 128 bits, at
128 bits there's no state in the Z registers that isn't shared with the
V registers so we can and do already skip handling them entirely which
makes the overhead of zeroing noticably lower, beyond that the cost
should be stable.  The additional cost for the Z registers was *about*
the same as that of just doing the P registers IIRC, that does track
with doing an additional 32 floating point operations.

128 bit systems are in general a bit of a special case for optimisation
due to the reduced extra state.

> Where does the zeroing happen now? IIRC it's only done on a subsequent
> trap to SVE and that's a lot more expensive (unless the code has changed
> since last time I looked).

At the minute we drop SVE permission for userspace on syscall entry, the
actual zeroing for the task happens when it next takes a SVE trap prior
to reenabling SVE for EL0.

> So if it's the actual subsequent trap that adds the overhead, maybe
> zeroing the regs while leaving TIF_SVE on won't be that bad.

Yeah, it's definitely *much* less exciting than the win from avoiding
the SVE trap.

> > The sysctl is disabled by default since it is anticipated that the risk
> > of disruption to userspace is low. As well as being within the
> > documented ABI this new behaviour mirrors the standard function call ABI
> > for SVE in the AAPCS which should mean that compiler generated code is
> > unlikely to rely on the current behaviour, the main risk is from hand
> > coded assembly which directly invokes syscalls. The new behaviour is
> > also what is currently implemented by qemu user mode emulation.

> IIRC both Will and Mark R commented in the past that they'd like the
> current de-facto ABI to become the official one. I'll let them comment.

That would be good.  I've not heard anything from Will either directly
or indirectly.  Mark R has indicated privately directly to me that he
originally pushed for the currently implemented behaviour and prefers
it.  Marc Zyngier has previously noted publicly the current behaviour
being a consideration in the context of discusion of optimisation ideas
like this one, I was a bit surprised that he commented on an earlier
patch in the series but not this one.  You have previously pushed back
on an attempt to document the current ABI, that was the main motivation
for writing this patch.  

I don't have a particular opinion either way but I do feel strongly that
we should either take advantage of the currently documented ABI or
document our actual ABI, right now we've got the worst of both worlds so
we should make a decision and pick one of those options.

> > -	if (test_thread_flag(TIF_SVE)) {
> > +	if (sve_syscall_regs_clear && test_thread_flag(TIF_SVE)) {
> >  		unsigned int sve_vq_minus_one;

> >  		sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;

> If we leave TIF_SVE on, does it mean that we incur an overhead on
> context switching? E.g. something like hackbench with lots of syscalls
> communicating between threads would unnecessarily context switch the SVE

It is true that in the context switch case if we zero on syscall entry
then we end up zeroing and then discarding the SVE state but in the
context of the context switch and a future SVE trap that should be
tolerable, and if the task doesn't use SVE between switches in syscalls
then it'll not have SVE enabled be impacted either way.

> state. Maybe there's something handling this but IIUC fpsimd_save()
> seems to only check TIF_SVE.

As of patch 6 in this series fpsimd_save() checks both TIF_SVE and
in_syscall() when using FP_STATE_TASK so we should only save the FPSIMD
state if we're in a syscall, in that patch we remove the clearing of
TIF_SVE on syscall and update fpsimd_save() to also check in_syscall()
when deciding what to save.  Prior to that we're keeping the behaviour
the same so fpsimd_sve() is only checking TIF_SVE, it's these last two
patches that intend to introduce behaviour changes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20220719/4caaee04/attachment.sig>