[PATCH v2 7/7] arm64/sve: Don't zero non-FPSIMD register state on syscall by default

Will Deacon will at kernel.org
Wed Jul 20 02:20:23 PDT 2022


On Tue, Jul 19, 2022 at 08:35:46PM +0100, Mark Brown wrote:
> On Tue, Jul 19, 2022 at 06:35:37PM +0100, Catalin Marinas wrote:
> > On Mon, Jun 20, 2022 at 01:41:58PM +0100, Mark Brown wrote:
> 
> > > The documented syscall ABI specifies that the SVE state not shared with
> > > FPSIMD is undefined after a syscall. Currently we implement this by
> > > always flushing this register state to zero, ensuring consistent
> > > behaviour but introducing some overhead in the case where we can return
> > > directly to userspace without otherwise needing to update the register
> > > state. Take advantage of the flexibility offered by the documented ABI
> > > and instead leave the SVE registers untouched in the case where can
> > > return directly to userspace.
> 
> > Do you have some rough numbers to quantify the gain? I suspect the
> > vector length doesn't matter much.
> 
> I got some benchmarking done with fp-pidbench (which like I think I say
> somewhere in the series is a nonsense benchmark but still) which showed
> an IIRC approximately 1% or so win from a similar change that was fairly
> consistent over a few implementations.  Unfortunately I don't yet have
> access to systems that would allow me to benchmark directly and
> nobody's got back with numbers for this specific code, the numbers were
> with some earlier proof of concept work.
> 
> There is some vector length dependency when we move over 128 bits, at
> 128 bits there's no state in the Z registers that isn't shared with the
> V registers so we can and do already skip handling them entirely which
> makes the overhead of zeroing noticably lower, beyond that the cost
> should be stable.  The additional cost for the Z registers was *about*
> the same as that of just doing the P registers IIRC, that does track
> with doing an additional 32 floating point operations.
> 
> 128 bit systems are in general a bit of a special case for optimisation
> due to the reduced extra state.
> 
> > Where does the zeroing happen now? IIRC it's only done on a subsequent
> > trap to SVE and that's a lot more expensive (unless the code has changed
> > since last time I looked).
> 
> At the minute we drop SVE permission for userspace on syscall entry, the
> actual zeroing for the task happens when it next takes a SVE trap prior
> to reenabling SVE for EL0.
> 
> > So if it's the actual subsequent trap that adds the overhead, maybe
> > zeroing the regs while leaving TIF_SVE on won't be that bad.
> 
> Yeah, it's definitely *much* less exciting than the win from avoiding
> the SVE trap.
> 
> > > The sysctl is disabled by default since it is anticipated that the risk
> > > of disruption to userspace is low. As well as being within the
> > > documented ABI this new behaviour mirrors the standard function call ABI
> > > for SVE in the AAPCS which should mean that compiler generated code is
> > > unlikely to rely on the current behaviour, the main risk is from hand
> > > coded assembly which directly invokes syscalls. The new behaviour is
> > > also what is currently implemented by qemu user mode emulation.
> 
> > IIRC both Will and Mark R commented in the past that they'd like the
> > current de-facto ABI to become the official one. I'll let them comment.
> 
> That would be good.  I've not heard anything from Will either directly
> or indirectly.  Mark R has indicated privately directly to me that he
> originally pushed for the currently implemented behaviour and prefers
> it.  Marc Zyngier has previously noted publicly the current behaviour
> being a consideration in the context of discusion of optimisation ideas
> like this one, I was a bit surprised that he commented on an earlier
> patch in the series but not this one.  You have previously pushed back
> on an attempt to document the current ABI, that was the main motivation
> for writing this patch.

I remember discussing this somewhere at some point with somebody, but that's
not a tonne of use!

My opinion is that the observable behaviour is what matters. So if we
documented that the register state is "undefined" but it is in fact always
zero, then we should fix the documentation.

Does that help?

Will



More information about the linux-arm-kernel mailing list