KVM/arm64: Guest ABI changes do not appear rollback-safe

Mon Jan 24 19:47:04 PST 2022

Hello all,

Based on the recent discussion on the patch '[RFC PATCH v3 01/11] KVM:
Capture VM start' [1], Reiji, I (and others in the team) were
wondering, since the hypercall feature-map is technically per-VM and
not per-vCPU, would it make more sense to present it as a kvm_device,
rather than vCPU psuedo-registers to the userspace?

If I understand correctly, the original motivation for going with
pseudo-registers was to comply with QEMU, which uses KVM_GET_REG_LIST
and KVM_[GET|SET]_ONE_REG interface, but I'm guessing the VMMs doing
save/restore across migration might write the same values for every
vCPU.

Granted that we would be diverging from the existing implementation
(psci versioning and spectre WA registers), but this can be a cleaner
way forward for extending hypercall support. The kvm_device interface
can be made backward compatible with the way hypercalls are exposed
today, it can have the same registers/features discovery mechanisms
that we discussed above, and majorly the userspace has to configure it
only once per-VM. We can probably make the feature selection immutable
just before any vCPU is created.

Please let me know your thoughts or any disadvantages that I'm overlooking.

Thanks,
Raghavendra

[1]: https://lore.kernel.org/kvmarm/CAJHc60zhRyOad7AqtEFn-Ptro5BGVkfpB2wXWGw5EZMxOHUc=w@mail.gmail.com/

On Fri, Oct 1, 2021 at 8:38 AM Oliver Upton <oupton at google.com> wrote:
>
> On Fri, Oct 1, 2021 at 4:43 AM Marc Zyngier <maz at kernel.org> wrote:
> >
> > On Thu, 30 Sep 2021 18:24:23 +0100,
> > Oliver Upton <oupton at google.com> wrote:
> > >
> > > Hey Marc,
> > >
> > > On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier <maz at kernel.org> wrote:
> > > >
> > > > Hi Oliver,
> > > >
> > > > On Wed, 29 Sep 2021 19:22:05 +0100,
> > > > Oliver Upton <oupton at google.com> wrote:
> > > > >
> > > > > I have some lingering thoughts on this subject since we last spoke and
> > > > > wanted to discuss.
> > > > >
> > > > > I'm having a hard time figuring out how a VMM should handle a new
> > > > > hypercall identity register introduced on a newer kernel. In order to
> > > > > maintain guest ABI, the VMM would need to know about that register and
> > > > > zero it when restoring an older guest.
> > > >
> > > > Just as it would need to be able to discover any new system register
> > > > exposed by default, as it happens at all times. Which is why we have a
> > > > way to discover all the registers, architected or not.
> > > >
> > > > > Perhaps instead we could reserve a range of firmware registers as the
> > > > > 'hypercall identity' registers. Implement all of them as RAZ/WI by
> > > > > default, encouraging userspace to zero these registers away for older
> > > > > VMs but still allowing an old userspace to pick up new KVM features.
> > > > > Doing so would align the hypercall identity registers with the feature
> > > > > ID registers from the architecture.
> > > >
> > > > The range already exists in the form of the "coprocessor" 0x14. I
> > > > don't see the need to expose it as RAZ/WI, however. If userspace
> > > > doesn't know about how this pseudo-register works, it won't be able to
> > > > program it anyway.
> > > >
> > > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by
> > > > default so that they don't UNDEF at runtime. The meaning of a RAZ
> > > > id-register is also well defined (feature not implemented), and the
> > > > CPU cannot write to them. In a way, the ID-regs *are* the enumeration
> > > > mechanism.
> > > >
> > > > Our firmware registers don't follow the same rules. Userspace can
> > > > write to them, and there is no such "not implemented" rule (case in
> > > > point, PSCI). We also have a separate enumeration mechanism
> > > > (GET_ONE_REG), which is (more or less) designed for userspace to find
> > > > what is implemented.
> > > >
> > > > For these reasons, I don't immediately see the point of advertising a
> > > > set of registers ahead of time, before userspace grows an
> > > > understanding of what these registers mean.
> > >
> > > Supposing we don't preallocate some hypercall ID registers, how can we
> > > safely migrate a guest from an older kernel to newer kernel? Ideally,
> > > we would preserve the hypercall feature set across the migration which
> > > could work for a while with the first set of registers that get
> > > defined, but whenever a new hypercall firmware register comes along
> > > then the VMM will be clueless to the new ABI.
> >
> > My expectations were that whenever userspace writes a set of firmware
> > register, all the defaults are invalidated. For example say that
> > host-A know about a single hypercall register, while host-B knows
> > about two. Upon writing to the first register, the host would clear
> > the set of available services in the second one. If userspace
> > eventually writes there, the value would stick if valid.
> >
> > Also, remember that pseudo-registers don't have to be 64bit. We could
> > define a new class of hypercall-specific registers that would be much
> > wider, and thus have a larger write granularity (KVM supports anything
> > from 8 to 2048 bits). This would make it pretty easy to implement the
> > above.
> >
> > > Fundamentally, I don't think userspace should need a patch to preserve
> > > ABI on a newer kernel. Despite that, it would seem that userspace will
> > > need to learn of any firmware registers that control hypercall
> > > features which come after the initial set that gets proposed. If
> > > KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall,
> > > architectural feature ID registers) from other parts of the vCPU
> > > state, it would be clear to what registers to zero on a newer kernel.
> > > Apologies if it is distracting to mention the feature ID registers
> > > here, but both are on my mind currently and want to make sure there is
> > > some consistency in how features get handled on newer kernels,
> > > architected or not.
> >
> > The problem I see is that we will always need to grow the HC register
> > space one way or another, no matter how many we reserve. Your approach
> > only works if we don't exceed that particular range. Maybe it will
> > never be a problem, but it remains that this is not scalable.
> >
> > If we wanted to be safe, we'd reserve the whole of the possible space
> > as defined by the SMCCC spec. Given that we currently have two HC
> > spaces (the ARM-controlled one, and the KVM-specific one), the
> > function space being 16bits in each case, that's 16kB worth of zeroes
> > that userspace has to save/restore at all times... I'm not overly
> > enthusiastic.
>
> Yeah, that's definitely overkill.
>
> I agree with your position; we can solve the issue altogether by
> enforcing ordering on the hypercall registers. Userspace should read
> all hypercall registers to discover features, then write them. Given
> your suggestion, I don't know if there is much need for the guesswork
> to conclude on an appropriate register size for the hypercall bitmaps.
> 64 bits seems fine to me..
>
> --
> Thanks,
> Oliver