[PATCH] KVM: arm64: Don't eagerly teardown the vgic on init error

Thu Oct 10 01:47:04 PDT 2024

On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
> On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton <oliver.upton at linux.dev> wrote:
> > Then if we can't register the MMIO region for the distributor
> > everything comes crashing down and a vCPU has made it into the KVM_RUN
> > loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> > another functional bug here where a vCPU's attempts to poke the
> > distributor wind up reaching userspace as MMIO exits. But we can worry
> > about that another day.
> 
> I don't think that one is that bad. Userspace got us here, and they
> now see an MMIO exit for something that it is not prepared to handle.
> Suck it up and die (on a black size M t-shirt, please).

LOL, I'll remember that.

The situation I have in mind is a bit harder to blame on userspace,
though. Supposing that the whole VM was set up correctly, multiple vCPUs
entering KVM_RUN concurrently could cause this race and have 'unexpected'
MMIO exits go out to userspace.

	vcpu-0				vcpu-1
	======				======
	kvm_vgic_map_resources()
	  dist->ready = true
	  mutex_unlock(config_lock)
	  				kvm_vgic_map_resources()
					  if (vgic_ready())
					    return 0

					< enter guest >
					typer = writel(0, GICD_CTLR)

					< data abort >
					kvm_io_bus_write(...)	<= No GICD, out to userspace

       vgic_register_dist_iodev()

A small but stupid window to race with.

> > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > the config_lock to cure the race, but that wound up inverting lock
> > ordering on srcu.
> 
> Probably something like that. We also used to hold the kvm lock, which
> made everything much simpler, but awfully wrong.
> 
> > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > creation if/when we get a new flavor of irqchip.
> 
> One of the things we should have done when introducing GICv3 is to
> impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> final. I remember some push-back on the QEMU side of things, as they
> like to decouple things, but this has proved to be a nightmare.

Pushing more of the initialization complexity into userspace feels like
the right thing. Since we clearly have no idea what we're doing :)

> > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > callees are allowed to destroy VM-scoped structures in error handling.
> 
> I think this is symptomatic of more general issue: we perform VM-wide
> configuration in the context of a vcpu. We have tons of this stuff to
> paper over the lack of a "this VM is fully configured" barrier.
> 
> I wonder whether we could sidestep things by punting the finalisation
> of the VM to a different context (workqueue?)  and simply return
> -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> solve the "I'm missing parts of the address map and I'm going to die"
> part though.

Throwing it back at userspace would be nice, but unfortunately for ABI I
think we need to block/spin vCPUs in the kernel til the VM is in fully
working condition. A fragile userspace could explode for a 'spurious'
EAGAIN/EINTR where there wasn't one before.

-- 
Thanks,
Oliver