[PATCH 1/4] KVM: arm64: vgic: Fix a circular locking issue
Jean-Philippe Brucker
jean-philippe at linaro.org
Wed Jun 7 06:28:19 PDT 2023
On Wed, Jun 07, 2023 at 09:37:08AM +0100, Marc Zyngier wrote:
> > > After this change landed in 6.4-rc5 as commit 59112e9c390b
> > > ("KVM: arm64: vgic: Fix a circular locking issue"), my QEMU Fedora VM on
> > > my SolidRun Honeycomb fails to get to GRUB.
> >
> > [...]
> >
> > > I built a kernel with CONFIG_PROVE_LOCKING=y but I do not see any splats
> > > while this is occurring. Additionally, neither my Raspberry Pi 4 or my
> > > Ampere Altra system have any issues, so it is possible this could be a
> > > platform specific problem. I am more than happy to provide any
> > > additional information and test kernels and patches to help get to the
> > > bottom of this. My kernel configuration is attached.
> >
> > I was unable to reproduce the issues you're seeing on 6.4-rc5, but I
> > don't have any different machines from you available atm. Based on
> > your description it sounds like your VM was able to do _something_
> > since it sounds like a few escape codes got out over serial...
> > I'm wondering if you're getting wedged somewhere on a VGIC MMIO access.
> >
> > We don't have a precise tracepoint for VGIC accesses, but kvm:kvm_mmio
> > should do the trick. So, given that you're the lucky winner at
> > reproducing this bug right now, do you mind collecting a dump from that
> > tracepoint and sharing the access that happens before your VM gets
> > wedged?
> >
> > Curious if Marc has any additional insight, since (unsurprisingly) he
> > has a lot more experience in dealing with the GIC than I. In the
> > meantime I'll stare at the locking flows and see if anything stands
> > out.
>
> RPI4 is GICv2 nVHE, the NXP machine is GICv3 nVHE, and the Altra is
> GICv3 VHE. Not sure this is relevant here, but that's one data point.
>
> Having been able to start the guest means that we should have fully
> initialised the GIC. So a lockup is likely be an interaction with the
> GIC emulation itself, either because we failed to release a lock
> during initialisation, or due to some logic error in the GIC emulation
> (which is not necessarily MMIO...).
>
> I've just given 6.4-rc5 a go on my Synquacer, which is the closest
> thing I have to Nathan's NXP box, and I can't spot anything odd.
>
> It would also help to get access to the EDK2 build. It wouldn't be the
> first time that a change in KVM breaks some EDK2 behaviour.
I found a build here:
https://koji.fedoraproject.org/koji/buildinfo?buildID=2204660
edk2-aarch64-20230301gitf80f052277c8-31.fc39.noarch.rpm
usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw
Haven't managed to reproduce the issue yet, but I can only test with QEMU
emulating the cortex-a72 and GICv3 at the moment, and I still need to
reproduce the VMM command-line exactly. I think it would be helpful to get
the exact grub image as well, right now I'm using
Fedora-Server-KVM-38-1.6.aarch64.qcow2
Thanks,
Jean
>
> Finally, on top of the traces that Oliver asked above, looking at
> where the QEMU vcpu threads are would be interesting (I assume they'd
> be sleeping in the kernel).
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list