[RFC] KVM API extensions for SVE

Thu Nov 23 10:40:50 PST 2017

On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> Hi Dave,
> 
> On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > Hi all,
> > 
> > SVE adds some new registers, and their size depends on the hardware ando
> > on runtime sysreg settings.
> > 
> > Before coding something, I'd like to get people's views on my current
> > approach here.
> > 
> > --8<--
> > 
> > New vcpu feature flag:
> > /*
> >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> >  * and will honour the size field in the reg iD
> >  */
> > #define KVM_ARM_VCPU_LARGE_REGS	4
> > 
> > Should we just error out of userspace fails to set this on a system that
> > supports SVE, or is that too brutal?  
> 
> That would break older userspace on newer kernels on certain hardware,
> which would certainly not be acceptable.  Or did I misunderstand?

Yes, which is probably bad.

I'm still trying to gauge the policy regarding backwards compatibility.

So, I guess you're saying is that we should disable SVE for the guest
but still run it in this case.

This creates another issue: if SVE is supported by the host kernel
but not enabled for the guest, do I need to hist the SVE regs from
KVM_GET_REG_LIST?

If not, a guest that doesn't have SVE created on a host that supports
SVE would not be migratable to a host kernel that doesn't support SVE
but otherwise could run the guest:  as I understand it, the attempt to
restore the SVE regs on the target node would fail because the host
kernel doesn't recognise those regs.

Or do we not claim to support backwards compatibility for migration?

> 
> > If we do treat that as an error,
> > then we can unconditionally enable SVE for the guest when the host
> > supports it -- which cuts down on unnecessary implementation complexity.
> 
> I think it makes sense to be able to disable SVE, especially if there's
> high overhead related to context switching the state.  Since you say the
> implementation complexity is unnecessary, I may be missing some larger
> point?

I don't think it's all that bad, but there doesn't seem to be any
example of an optional architecture feature for a guest today --
so I wanted to check before setting a precedent here.

Would "SVE enabled" be better as an attribute, rather than a
feature, or does it not really matter?

> > 
> > Alternatively, we would need logic to disable SVE if userspace is too
> > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > the flag is set the same on every vcpu -- but from the kernel's PoV
> > it probably doesn't matter.
> 
> Not sure I understand why it doesn't matter from the kernel's PoV.
> 
> I think SVE should be disabled by default (as it is now) and then we
> should expose a capability (potentially simply via the vcpu attributes
> being present), and let userspace enable SVE and set a vector length.

Yes, but aren't all the attributes/features per-vcpu?

> It makes sense that userspace needs to know about SVE for VMs to use it,
> doesn't it?

Yes and no.  Except for debug purposes I don't see why userspace needs
to know anything execpt how to handle large registers through the ioctl
interface.

> I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> being way too optimistic about the ARM ecosystem here?  Just like we
> don't model big.LITTLE, I think we should enforce in the kernel, that

The kernel follows the same policy: if SVE is not present on all
physical CPUs we disable it completely and hide it from guests and
userspace.

For vector length I'm a little more permissive: the max vector length
would be clamped to the minimum commonly supported vector length.

> userspace configures all VCPUs with the same SVE properties.

OK, so long as you think it's not superfluous to do it, then I'm happy
to do it.

> > 
> > /*
> >  * For the SVE regs, we add some new reg IDs.
> >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> >  * slices.  This is sufficient for only a single slice to be required
> >  * per register for SVE, but ensures expansion room in case future arch
> >  * versions grow the maximum size.
> >  */
> 
> I don't understand the last part of this comment?

This may be explained better in by response below.

> > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> 
> Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> 
> > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > 		((n) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > 		(((n) + 32) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > 	 KVM_REG_ARM64_SVE_P(16, i)
> > 
> > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > 
> 
> This is hard to read and understand the way presented here.  I would
> suggest you formulate this suggestion in the form of an RFC patch to
> Documentation/virtual/kvm/api.txt plus the header definitions.

Sure, I hadn't figured out the best way to present this: I was thinking
aloud.

> (I'm not sure where to look to look to decode the "<< 5" and the
> " + 32) << 5)" stuff above.

The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
They are numbered serially.

However, the SVE architecture leaves the possibility of future
expansion open, up to 32 times the current maximum.

The KVM reg API doesn't support such ridiculously huge registers,
so my proposal is to slice them up, indexed by the value in the
bottom 5 bits of the reg ID.  This requires the "register ID"
field to be shifted up by 5 bits.

If the regs are not oversized (never, for the current architecture),
then we simply don't list those extra slices via KVM_GET_REG_LIST.

> > Bits above the max vector length could be
> 
> Which bits are these and where are they, and why do we have them?

The KVM register size via ioctl is fixed at 2048 bits here.  Since
the system might not support vectors that large, then bits 2047:1024
in the ioctl payload wouldn't map to any register bits in the hardware.
Should KVM still store them somewhere?  Should they logically exist
for the purpose of the ioctl() API?

Making the size dynamic to avoid surplus bits doesn't work, because KVM
only supports power-of-two reg sizes, whereas SVE can support non-power-
of-two sizes.

> Is the max vector length the max architecturally (current version)
> defined length, or what is chosen for this VM?

For now, that's an open question.

> >  * don't care (or not copied at all) on read; ignored on write
> >  * zero on read; ignored on write
> >  * zero on read; must be zero on write
> > 
> > Bits between the current and max vector length are trickier to specify:
> > the "current" vector length for ioctl access is ill-defined, because
> > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > and access to ZCR_EL1.
> 
> I think you want userspace to be able to read/write these values in any
> order compared to configuring SVE for the VM, and then fix up whatever
> needs masking etc. in the kernel later, if possible.  Ordering
> requirements to userspace accesses have shown to be hard to enforce and
> get right in the past.

So I've heard from other people.

> What happens on hardware if you give a certain vector length to EL0, EL0
> writes a value of the full length, and then EL1 restricts the length to
> something smaller, and subsequently expands it again?  Is the original
> full value visible or are some bits potentially lost?  IOW, can't we
> rely on what the architecture says here?

The architecture says that bits that are hidden and then revealed again
are either preserved whilst hidden, or zeroed.

Opinion differs on whether that's a good thing to expose in ABI: Will
considered it unacceptable to expose this kind of behaviour around
syscalls from userspace for example, so I currently always zero the
bits in that case even though it's slightly more expensive.

The concern here was that userspace might unintentionally rely on
the affected register bits being preserved around syscalls when this
is not guaranteed by the implementation.

This does not mean that similar design considerations apply to the KVM
ioctl interface though.

> > 
> > So, it may be simpler to expose the full maximum supported vector size
> > unconditionally through ioctl, and pack/unpack as necessary.
> 
> yes, I think this was what I tried to say.
> 
> > 
> > Currently, data is packed in the vcpu struct in a vector length
> > dependent format, since this seems optimal for low-level save/restore,
> > so there will be potential data loss / zero padding when converting.
> > 
> > This may cause some unexpected effects.  For example:
> > 
> > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > /* Guest's current vector length will be 128 bits when started */
> > 
> > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > 
> > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > 
> > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > 
> 
> I really don't know how to parse this or what the point here is?  Sorry.

It means that for the ioctl interface, "obvious" guarantees like "if you
read a register you get back the last value written" don't work quite as
expected.  Some bits may have disappeared, or not, depending on the
precise scenario.

> > 
> > Since the guest should be treated mostly as a black box, I'm not sure
> > how big a deal this is.  The guest _might_ have explicitly set those
> > bits to 0... who are we to say?
> 
> How big a deal what is?  I'm lost.
> 
> > 
> > Can anyone think of a scenario where effects like this would matter?
> > 
> 
> I think we need more information.

See if my comments above throw any light on this.

I may be trying to make an issue out of nothing here.

Cheers
---Dave