[PATCH 1/4] KVM: arm64: vgic: Fix a circular locking issue

Wed Jun 7 01:37:08 PDT 2023

On Wed, 07 Jun 2023 06:23:24 +0100,
Oliver Upton <oliver.upton at linux.dev> wrote:
> 
> Nathan,
> 
> First and foremost, thanks for testing this.
> 
> On Tue, Jun 06, 2023 at 03:15:25PM -0700, Nathan Chancellor wrote:
> > My apologies if this has been addressed or reported somewhere, I did a
> > search of lore.kernel.org and browsed the kvmarm archives and did not
> > see anything.
> 
> This is news to me, but even if it had already been reported there's
> nothing wrong with bumping the issue. Makes it hard for us to bury our
> heads in the sand :)

AFAICT, this is the very first report of this problem.

> 
> > After this change landed in 6.4-rc5 as commit 59112e9c390b
> > ("KVM: arm64: vgic: Fix a circular locking issue"), my QEMU Fedora VM on
> > my SolidRun Honeycomb fails to get to GRUB.
> 
> [...]
> 
> > I built a kernel with CONFIG_PROVE_LOCKING=y but I do not see any splats
> > while this is occurring. Additionally, neither my Raspberry Pi 4 or my
> > Ampere Altra system have any issues, so it is possible this could be a
> > platform specific problem. I am more than happy to provide any
> > additional information and test kernels and patches to help get to the
> > bottom of this. My kernel configuration is attached.
> 
> I was unable to reproduce the issues you're seeing on 6.4-rc5, but I
> don't have any different machines from you available atm. Based on
> your description it sounds like your VM was able to do _something_
> since it sounds like a few escape codes got out over serial...
> I'm wondering if you're getting wedged somewhere on a VGIC MMIO access.
> 
> We don't have a precise tracepoint for VGIC accesses, but kvm:kvm_mmio
> should do the trick. So, given that you're the lucky winner at
> reproducing this bug right now, do you mind collecting a dump from that
> tracepoint and sharing the access that happens before your VM gets
> wedged?
> 
> Curious if Marc has any additional insight, since (unsurprisingly) he
> has a lot more experience in dealing with the GIC than I. In the
> meantime I'll stare at the locking flows and see if anything stands
> out.

RPI4 is GICv2 nVHE, the NXP machine is GICv3 nVHE, and the Altra is
GICv3 VHE. Not sure this is relevant here, but that's one data point.

Having been able to start the guest means that we should have fully
initialised the GIC. So a lockup is likely be an interaction with the
GIC emulation itself, either because we failed to release a lock
during initialisation, or due to some logic error in the GIC emulation
(which is not necessarily MMIO...).

I've just given 6.4-rc5 a go on my Synquacer, which is the closest
thing I have to Nathan's NXP box, and I can't spot anything odd.

It would also help to get access to the EDK2 build. It wouldn't be the
first time that a change in KVM breaks some EDK2 behaviour.

Finally, on top of the traces that Oliver asked above, looking at
where the QEMU vcpu threads are would be interesting (I assume they'd
be sleeping in the kernel).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.