[PATCH] KVM: arm64: Don't eagerly teardown the vgic on init error

Thu Oct 10 05:47:05 PDT 2024

On Thu, 10 Oct 2024 09:47:04 +0100,
Oliver Upton <oliver.upton at linux.dev> wrote:
> 
> On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
> > On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton <oliver.upton at linux.dev> wrote:
> > > Then if we can't register the MMIO region for the distributor
> > > everything comes crashing down and a vCPU has made it into the KVM_RUN
> > > loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> > > another functional bug here where a vCPU's attempts to poke the
> > > distributor wind up reaching userspace as MMIO exits. But we can worry
> > > about that another day.
> > 
> > I don't think that one is that bad. Userspace got us here, and they
> > now see an MMIO exit for something that it is not prepared to handle.
> > Suck it up and die (on a black size M t-shirt, please).
> 
> LOL, I'll remember that.
> 
> The situation I have in mind is a bit harder to blame on userspace,
> though. Supposing that the whole VM was set up correctly, multiple vCPUs
> entering KVM_RUN concurrently could cause this race and have 'unexpected'
> MMIO exits go out to userspace.
> 
> 	vcpu-0				vcpu-1
> 	======				======
> 	kvm_vgic_map_resources()
> 	  dist->ready = true
> 	  mutex_unlock(config_lock)
> 	  				kvm_vgic_map_resources()
> 					  if (vgic_ready())
> 					    return 0
> 
> 					< enter guest >
> 					typer = writel(0, GICD_CTLR)
> 
> 					< data abort >
> 					kvm_io_bus_write(...)	<= No GICD, out to userspace
> 
>        vgic_register_dist_iodev()
> 
> A small but stupid window to race with.

Ah, gotcha. I guess getting rid of the early-out in
kvm_vgic_map_resources() would plug that one. Want to post a fix for
that?

> 
> > > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > > the config_lock to cure the race, but that wound up inverting lock
> > > ordering on srcu.
> > 
> > Probably something like that. We also used to hold the kvm lock, which
> > made everything much simpler, but awfully wrong.
> > 
> > > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > > creation if/when we get a new flavor of irqchip.
> > 
> > One of the things we should have done when introducing GICv3 is to
> > impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> > final. I remember some push-back on the QEMU side of things, as they
> > like to decouple things, but this has proved to be a nightmare.
> 
> Pushing more of the initialization complexity into userspace feels like
> the right thing. Since we clearly have no idea what we're doing :)

KVM APIv2?

> 
> > > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > > callees are allowed to destroy VM-scoped structures in error handling.
> > 
> > I think this is symptomatic of more general issue: we perform VM-wide
> > configuration in the context of a vcpu. We have tons of this stuff to
> > paper over the lack of a "this VM is fully configured" barrier.
> > 
> > I wonder whether we could sidestep things by punting the finalisation
> > of the VM to a different context (workqueue?)  and simply return
> > -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> > solve the "I'm missing parts of the address map and I'm going to die"
> > part though.
> 
> Throwing it back at userspace would be nice, but unfortunately for ABI I
> think we need to block/spin vCPUs in the kernel til the VM is in fully
> working condition. A fragile userspace could explode for a 'spurious'
> EAGAIN/EINTR where there wasn't one before.

EINTR needs to be handled already, as this is how you report
preemption by a signal. But yeah, overall, I'm not enthralled with
much so far...

	M.

-- 
Without deviation from the norm, progress is not possible.