[PATCH] KVM: arm64: Don't eagerly teardown the vgic on init error
Oliver Upton
oliver.upton at linux.dev
Thu Oct 10 01:47:04 PDT 2024
On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
> On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton <oliver.upton at linux.dev> wrote:
> > Then if we can't register the MMIO region for the distributor
> > everything comes crashing down and a vCPU has made it into the KVM_RUN
> > loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> > another functional bug here where a vCPU's attempts to poke the
> > distributor wind up reaching userspace as MMIO exits. But we can worry
> > about that another day.
>
> I don't think that one is that bad. Userspace got us here, and they
> now see an MMIO exit for something that it is not prepared to handle.
> Suck it up and die (on a black size M t-shirt, please).
LOL, I'll remember that.
The situation I have in mind is a bit harder to blame on userspace,
though. Supposing that the whole VM was set up correctly, multiple vCPUs
entering KVM_RUN concurrently could cause this race and have 'unexpected'
MMIO exits go out to userspace.
vcpu-0 vcpu-1
====== ======
kvm_vgic_map_resources()
dist->ready = true
mutex_unlock(config_lock)
kvm_vgic_map_resources()
if (vgic_ready())
return 0
< enter guest >
typer = writel(0, GICD_CTLR)
< data abort >
kvm_io_bus_write(...) <= No GICD, out to userspace
vgic_register_dist_iodev()
A small but stupid window to race with.
> > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > the config_lock to cure the race, but that wound up inverting lock
> > ordering on srcu.
>
> Probably something like that. We also used to hold the kvm lock, which
> made everything much simpler, but awfully wrong.
>
> > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > creation if/when we get a new flavor of irqchip.
>
> One of the things we should have done when introducing GICv3 is to
> impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> final. I remember some push-back on the QEMU side of things, as they
> like to decouple things, but this has proved to be a nightmare.
Pushing more of the initialization complexity into userspace feels like
the right thing. Since we clearly have no idea what we're doing :)
> > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > callees are allowed to destroy VM-scoped structures in error handling.
>
> I think this is symptomatic of more general issue: we perform VM-wide
> configuration in the context of a vcpu. We have tons of this stuff to
> paper over the lack of a "this VM is fully configured" barrier.
>
> I wonder whether we could sidestep things by punting the finalisation
> of the VM to a different context (workqueue?) and simply return
> -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> solve the "I'm missing parts of the address map and I'm going to die"
> part though.
Throwing it back at userspace would be nice, but unfortunately for ABI I
think we need to block/spin vCPUs in the kernel til the VM is in fully
working condition. A fragile userspace could explode for a 'spurious'
EAGAIN/EINTR where there wasn't one before.
--
Thanks,
Oliver
More information about the linux-arm-kernel
mailing list