[RFC v2 REPOST] arm64: KVM: KVM API extensions for SVE

Wed Feb 7 06:58:31 PST 2018

On Wed, Feb 07, 2018 at 11:33:28AM +0000, Dave Martin wrote:
> On Tue, Feb 06, 2018 at 02:17:57PM +0100, Christoffer Dall wrote:
> > On Tue, Feb 06, 2018 at 11:43:16AM +0000, Dave Martin wrote:
> > > On Mon, Feb 05, 2018 at 05:13:08PM +0100, Christoffer Dall wrote:
> > > > Hi Dave,
> > > > 
> > > > On Fri, Jan 26, 2018 at 05:28:49PM +0000, Dave Martin wrote:
> 
> [...]
> 
> > > Exposing variable-size regs directly through the ioctl interface,
> > > or growing the ioctl view of the regs much beyond 2048 bits seems
> > > impractical, so the sliced view seemed reasonable.  In practice, we
> > > might never need more than one slice anyway.
> > > 
> > > > >    The slice sizes allow each register to be read/written in exactly
> > > > >    one slice for SVE.
> > > > > 
> > > > >    Surplus bits (beyond the maximum VL supported by the vcpu) will
> > > > >    be read-as-zero write-ignore.
> > > > 
> > > > This implies that there are some ordering fun to be had between
> > > > accessing SVE registers and changing the maximum vector length for the
> > > > vcpu.  Why not loosen the API slightly and say that anything written to
> > > > the surplus bits is not guaranteed to be preserved when read back?  (In
> > > > other words, we make no guarantees).
> > > 
> > > I'm happy enough to say that, since it may remove some burden from the
> > > implementation; though only memzeroing so perhaps not too exciting.
> 
> [...]
> 
> > > > >  * KVM_REG_ARM_SVE_Z(n, 0) bits [127:0] alias Vn in
> > > > >    KVM_REG_ARM_CORE(fp_regs.v[n]) .. KVM_REG_ARM_CORE(fp_regs.v[n])+3.
> > > > 
> > > > nit: KVM_REG_ARM_CORE_REG doesn't specify the size, so I think this
> > > > would be more clearly stated as
> > > > "aliases KVM_REG_SIZE_U128 | KVM_REG_ARM_CORE(fp_regs.vregs[n])" or as
> > > > "aliases kvm_regs.regs.fp_regs.vregs[n]".
> > > 
> > > I thought the size was effectively hard-coded to be 64 bits for
> > > KVM_REG_ARM_CORE().  Or did I misunderstand?
> > > 
> > 
> > It's not.  KVM_REG_ARM_CORE only encodes the 'group' of registers.
> > Userspace then has to OR on the size.  So the 128 bits FP regs are
> > accessed with the following logic in QEMU:
> > 
> >   #define AARCH64_SIMD_CORE_REG(x)				\
> > 		(KVM_REG_ARM64 | KVM_REG_SIZE_U128 |		\
> >                  KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))
> > 
> >   reg.id = AARCH64_SIMD_CORE_REG(fp_regs.vregs[i]);
> > 
> > Looking back, the macros in kvm.h are potentially a bit silly, and
> > should rather have a full AARCH64_SIMD_CORE_REG() definition (maybe we
> > should add that), but the idea is to use the offset into the kvm
> > structure as an indication for *which* register to retrieve, and then
> > encode the size of the register at the same time, and provide a pointer
> > to a value of the same size, which should avoid any endianness problems
> > etc.
> 
> After another look at arch/arm64/kvm/guest.c:get_core_reg(), I get
> this now.
> 
> > > > >    It's simplest for userspace if the two views always appear to be
> > > > >    in sync, but it's unclear whether this is really useful.  Perhaps
> > > > >    this can be relaxed if it's a big deal for the KVM implementation;
> > > > >    I don't know yet.
> > > > 
> > > > It's not going to be a big deal, and it's the only sensible thing to do;
> > > > otherwise we'll have no clear semantics of where the values are.  If KVM
> > > > chooses to duplicate the 128 bits of state in memory, then it must also
> > > > know how to syncrhonize that duplicated state when running the guest,
> > > > and therefore can also do this in software on SET/GET.
> > > 
> > > I tried to remember my full rationale and describe what the rules
> > > would be if the two views are allowed to be incoherent ... and
> > > concluded that you are probably correct.  This also matches the
> > > architectural view of the registers, which is what users would
> > > naturally expect.
> > 
> > Some early experiences taught us that anything we expose to user space
> > should really be as close to the architectural concepts as possible,
> > otherwise we quickly venture into land of ambiguity, which is a terrible
> > place to be for a kernel<->userspace ABI.
> 
> Agreed!
> 
> > > Software intentionally modifying the registers would need to know
> > > about this aliasing, but since software is explicitly required to
> > > enable the KVM_ARM_VCPU_SVE feature, this shouldn't pose problems
> > > for backwards compatibility.
> > > 
> > 
> > Agreed.
> > 
> > > > > Vector length control:
> 
> [...]
> 
> > > > >    ioctl(vcpu_fd, KVM_ARM_SVE_SET_VLS, __u64 vqs[SVE_VQ_MAX / 64]);
> > > > 
> > > > I don't understand the proposed semantics here.  What is the vqs array
> > > > and how should the values be set?
> > > 
> > > vqs is a bitmap, where bit (n - 1) indicates support for VL = 128n bits.
> > > 
> > > Since the hardware is not required to support all possible vector
> > > lengths up to the maximum, a bitmap seemed the most straightforward way
> > > of representing the possibly sparse set of supported VLs.
> > > 
> > > Calling it vq_bitmap[] would make this clearer.
> > 
> > Either way, I just think the details need to be spelled out.
> 
> Sure.  This won't be the final documentation, but I'll bear it in
> mind.  The vq bitmap concept is the same as used to track the supported
> vector lengths in fpsimd.c, but sane people are probably not aware of
> that.
> 
> > > Describing only the maximum supported vl turns out to be insufficient.
> > > 
> > > [...]
> > > 
> > > > >    At present, other than initialising each vcpu to the maximum
> > > > >    supportable set of VLs, I don't propose having a way to probe for
> > > > >    what sets of VLs are supportable: the above call either succeeds or
> > > > >    fails.
> > > > > 
> > > > 
> > > > I think we should have such a way.  Libvirt can do things like figure
> > > > out compatible features across nodes, and trying to set all possible
> > > > vector lengths doesn't seem like a great approach.
> > > > 
> > > > How about creating the interface based on a bitmap of all the possible
> > > > lengths, and setting the bits for all lengths, and reading back the
> > > > value returns the actual supported lengths?
> > > 
> > > Would it be sufficient for an initial call to KVM_ARM_VCPU_GET_SVE_VLS
> > > to return the default (i.e., maximum supported) set?
> > > 
> > > This would mean that once KVM_ARM_VCPU_SET_SVE_VLS has been called to
> > > set a different set, it would no longer be possible to find out the
> > > maximum set.  I'm not sure this would matter, though.
> > > 
> > 
> > I don't particularly like this, because the init sequence in QEMU can be
> > pretty complicated, and I think it can become pretty nasty to have to
> > remember if we 'once upon a time' called some ioctl, because then
> > calling it now will mean something else.
> > 
> > In case my suggestion wasn't clear, this is what I meant:
> > 
> >     #define NR_VQS ((32 * 2048 / 128) / 64)
> > 
> >     __u64 vqs[NR_VQS] = { [0 ... NR_VQS - 1] = ~0 };
> > 
> >     ioctl(vcpu_fd, KVM_ARM_SVE_SET_VLS, vqs);
> >     ioctl(vcpu_fd, KVM_ARM_SVE_GET_VLS, vqs);
> > 
> >     for (i = 0; i < 16; i++)
> >         if (vqs[0] & BIT(i))
> >             printf("vector length %d supported by KVM\n", (i+1) * 128);
> 
> This does mean that if the caller doesn't get the set they asked for
> in KVM_ARM_SVE_SET_VLS this is not treated as an error by the kernel.  I
> was concerned this would encourage userspace to silently miss this
> error.

Good point.

> 
> What if KVM_ARM_SVE_SET_VLS() were to yield 0 if the exact requested set
> of VLs was configured, -ERANGE if some subset was configured successfully
> (but not exactly the set requested), and the usual -EINVAL/-EIO/whatever
> if the set of VLs couldn't be configured at all?

Sounds good to me.

> 
> Then the probe would go like this:
> 
> 	__u64 vqs[SVE_VQ_MAX / 64] = { [0 ... SVE_VQ_MAX / 64 - 1] = ~(u64)0 };
> 	if (ioctl(vcpu_fd, KVM_ARM_SVE_SET_VLS, vqs) && errno != ERANGE))
> 		goto error;
> 
> 	ioctl(vcpu_fd, KVM_ARM_SVE_GET_VLS, vqs);
> 
> 	/* ... */
> 
> Another option would be for SVE_ARM_SVE_SET_VLS to write the resulting
> set back to its argument, possibly with the same 0/ERANGE/EINVAL semantics.
> 
> 
> Alternatively, userspace would be require to do a KVM_ARM_SVE_GET_VLS,
> and check the resulting set:
> 
> 	/* ... */
> 	
> 	__u64 newvqs[SVE_VQ_MAX / 64];
> 	ioctl(vcpu_fd, KVM_ARM_SVE_GET_VLS, newvqs);
> 
> 	if (memcmp(vqs, newvqs, sizeof vqs))
> 		goto mismatch;
> 
> vcpu restore would need to treat any mismatch as an error:
> the exact requested set but be configurable, or the VM may go wrong.

I'm not sure I can parse this sentence or extract the meaning?

> 
> 
> Any opinion on which approach is best?
> 

I think I prefer letting KVM_ARM_SVE_SET_VLS change the supplied vqs,
since having it be two separate ioctls always potentially leaves room
for some other thread having modified the set in the meantime (or making
a programmer doubt if this can be the case) where a single ioctl() will
look atomic.

The user can always call KVM_ARM_SVE_GET_VLS afterwards and should get
the same result.

I think it's worth trying to write this up as patches to the KVM
documentation and to the kernel and see what people say on that.

> I was fighting with this in December, trying to come up with a way
> for userspace to specify which VLs it requires/permits/forbids and
> letting the kernel come up with something matching these constraints.
> But this seemed too expressive and complex, so I had dropped the idea
> in favour of something simpler.
> 
Thanks,
-Christoffer