[PATCH v11 06/40] arm64/sme: Provide ABI documentation for SME

Mark Brown broonie at kernel.org
Mon Feb 14 11:40:52 PST 2022


On Mon, Feb 14, 2022 at 06:19:58PM +0000, Catalin Marinas wrote:
> On Fri, Feb 11, 2022 at 06:13:58PM +0000, Mark Brown wrote:

> > We could preserve PSTATE.SM, though since all the other register state
> > for streaming mode is shared with SVE I would expect that we should be
> > applying the SVE discard rules to it and there is therefore no other
> > state that should be retained.

> So when clearing PSTATE.SM, the streaming SVE regs become unknown (well,
> the wording is a bit more verbose). I think this fits well with the
> proposal to drop the streaming SVE state entirely on syscalls.

They're preserved or zeroed, yes.

> The ZA state I think is not affected by the PSTATE.SM change (early
> internal SME specs were listing this as unknown after SM clearing but I
> can't find it in the latest spec). However, after the syscall, the user
> won't be able to execute SME instruction until turning on PSTATE.SM
> again.

Yes, ZA is preserved unless PSTATE.ZA is disabled.  There are some
instructions that can be used to interact with it outside of streaming
mode, a subset of the instructions for loading and storing values in ZA.

> Would the libc wrappers preserve PSTATE.SM? What I find a bit confusing
> is that we only partially preserve some state while in streaming mode -
> the ZA registers but not the SVE ones.

I would expect that libc wrappers would expect to be called with
streaming mode already disabled - that's what default functions in the
PCS expect, and since without FA64 enabled a huge proportion of FPSIMD
instructions and some SVE instructions become undefined standard code
could easily generate traps if it uses those instructions for anything.
I wouldn't expect that libc would explicitly disable SME itself in
standard configurations.

>                                        Is the user more likely to turn
> PSTATE.SM on for ZA processing or for SVE? If the former, we don't want
> to unnecessarily save/restore some SVE state that the user doesn't care

It's expected that any active work with ZA will require enabling
streaming mode, you can't do any actual computation with it without
doing so and most of the work with ZA will involve using the streaming
mode SVE registers as part of the computation (eg, collecting results in
a Z register, or doing an operation to a ZA tile using the contents of a
Z register as an operand).

It is also expected that some applications may prefer to execute what is
mainly a SVE workload in streaming mode, as well as any performance
relevant differences in the implementation choices the hardware makes it
is likely that some systems will have vector lengths available in
streaming mode that are otherwise unavailable (eg, you might have PEs
with 128 bit FPSIMD/SVE units and a 512 bit SMCU).

I don't have a good handle on which sort of usage is going to be more
common, and I expect that the answer is going to be very system
dependent varying based on both the mix of applications running on the
system at any given moment and the capabilities of the standard and
streaming mode floating point implementations that the system has.

However the existing syscall ABI for the Z and P registers (which is all
the SVE register state, FFR is a magic P register) means that unless we
treat streaming mode differently to non-streaming mode we'll be
discarding whatever state is there anyway so userspace by definition
shouldn't have anything in there it expects to be preserved when it does
a syscall.  I'd rather not introduce an ABI that guarantees that we
preserve the streaming mode SVE register state in cases where we discard
(or can discard) the non-streaming SVE register state, that's both going
to be more complicated to implement and more likely to cause unexpected
differences that trip userspace up.

> about (can we even trap SVE instructions independently of SME while in
> streaming mode?).

I'd need to check through but I don't believe so.

> I'd find it clearer if we preserved PSTATE.SM and, w.r.t. the streaming
> SVE state, we somewhat follow the PCS and not restore the regs (input
> from the libc people welcomed).

Like I say we can do that easily enough, it's not something I expect to
ever come up in practical usage though.

> > Having said that as with ZA userspace can just exit streaming mode to
> > avoid any overhead having it enabled introduces and the common case is
> > expected to be that it will have done so due to the PCS, it should be an
> > extremely rare case - unlike keeping ZA active there doesn't seem to be
> > any case where it would be sensible to want to do this and the PCS means
> > you'd have to actively try to do so.

> IIUC, the PCS introduced the notion of streaming-compatible functions
> that preserve the SM bit. If they are non-streaming, SM should be 0 on

Yes, it isn't the default though.

> entry. It would be nice if we put the syscalls in one of these
> categories, so either mandate SM == 0 on entry or preserve (the latter
> being easier, I think, I haven't looked at what it takes to save/restore
> the streaming SVE state; I may change my mind after reviewing at the
> other patches).

The streaming SVE state is identical to the SVE state with the exception
of the FFR predicate register which is not present unless FA64 is
available in the system and enabled and the separatly configured vector
length.

It's sounding like we may as well just preserve SM, it shouldn't come up
that often anyway and if it causes performance problems we can probably
optimise it, and/or userspace can simply just not do that.  Like I say I
don't have particularly strong feelings, the current behaviour was just
the easiest thing to implement and it doesn't seem like there is a use
case.  This is fine by me, I can do that for the next version.

[fork()/clone() behaviour]
> (few hours later) I think instead of singling out fork() (clone3()
> actually), we can just say that new tasks (process/thread) always start
> with PSTATE.ZA == 0, PSTATE.SM == 0 (tbd for this) and TPIDR2_EL0 == 0
> irrespective of any clone3() flags (even CLONE_SETTLS). The C library
> will have to implement the lazy ZA saving in the parent before the
> syscall and the child will automatically recover the state if it follows
> the PCS.

Works for me, I think forcing the userspace to consider this is going to
work out more robust.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20220214/7fe3892b/attachment.sig>


More information about the linux-arm-kernel mailing list