[RFC PATCH] KVM: arm/arm64: Don't let userspace update CNTVOFF once guest is running

Thu Jul 9 05:08:54 PDT 2015

Hi Jan,

On Thu, Jul 09, 2015 at 12:40:39PM +0200, Jan Kiszka wrote:
> On 2015-07-09 12:22, Christoffer Dall wrote:
> > Hi Peter and Marc,
> > 
> > [cc'ing Paolo for his input on x86 timekeeping]
> > 
> > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
> >> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier at arm.com> wrote:
> >>> On 08/07/15 17:06, Peter Maydell wrote:
> >>>> I'd prefer it if somebody could investigate to see why QEMU
> >>>> is actually doing this -- so far we just have speculation.
> >>>
> >>> I'd prefer that too, but so far people seem to be more comfortable
> >>> waiting for the issue to fix itself. In the meantime, VMs are broken in
> >>> weird and wonderful ways, and I don't think the current status-quo helps
> >>> anyone.
> >>
> >> Putting in a patch which might not be the right fix isn't
> >> necessarily a good plan either...
> >>
> >> Does has_run_once get cleared if we do a re-VCPU_INIT
> >> of a CPU that's run before? (We need to allow rewriting
> >> of guest state at that point so that "reset VM and
> >> load migration state" behaves correctly.)
> > 
> > no, it does not, has_run_once is set the first time a VCPU is run and is
> > currently *never* cleared.
> > 
> >>
> >> I suspect Jan is right and we really need to distinguish
> >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
> >> implies some kind of whitelist/override mechanism, since
> >> by and large we neither know nor want to know the
> >> semantics for system registers, we leave that up to the
> >> kernel.
> >>
> >> Q: if you have a running VM, and you pause it for
> >> an hour, what should the CNTVCT register do? Presumably
> >> it should not advance, but how do we arrange for that
> >> to happen?
> >>
> > 
> > I think the CNTVCT should not advance when the VM is not scheduled, so
> > if we pause the VM or starve all the VCPUs for enough time, the guest
> > should not see time progressing, since otherwise the guest scheduler
> > cannot maintain fairness and you're bound to see spurious RCU stalls
> > etc.
> > 
> > That's exactly why a guest can read both a virtual and physical counter
> > and it is an area where you simply want some level of
> > paravirtualization.  I haven't studied how/if Linux deals with this at
> > all.
> > 
> > So I think adjusting CNTVOFF should be managed by the kernel for the
> > pause/starvation scenario (which I think Avi once told me x86 does too -
> > does anyone know the current state of the art?).
> > 
> > So the only situation where I think userspace should adjust the CNTVOFF
> > value is for migration where we are talking about a brand new VM with
> > has_run_once clear.
> > 
> > Thus, if we were designing this from scratch now, the API should
> > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> > VM has run once, but it's too late for that as we would break userspace.
> > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> > in the kernel side as well, and finally also fix QEMU so that it doesn't
> > try to do the thing that future kernels will ignore.
> 
> Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be
> done, but I don't think the approach for the kernel is generally right.
> The kernel should not do any policing on user space requests to change
> the VCPU or VM state unless
> 
>  - security is affected
>  - userspace lacks information, thus cannot decide correctly
>  - legacy userspace has a bug, we can detect it and want to fix that up
>    without affecting future userspace that has a reason to do it
>    differently
> 
> Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe
> the last one, don't know. Just think of the hypothetical scenario that a
> userspace VM debugger wants to inject certain register manipulations. If
> you block this by some hidden VM state like proposed, that feature would
> no longer be implementable easily.
> 
Hmm, I didn't think about this, I guess reverse execution and
record/replay scenarios would also want you to make time appear to be
going backwards?

But the third case kind of does apply, if we care about fixing legacy
userspace in the kernel.  We could define a new register index for the
ioctl that allows time to move forward that future userspace can use,
and impose the restriction of not making time go backwards with the
current index.  Would that break any currently working use cases on
arm64 with legacy userspace, such as migration?

Thanks,
-Christoffer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: Digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20150709/660d9eea/attachment.sig>