Question on guest enable msi fail when using GICv4/4.1
Marc Zyngier
maz at kernel.org
Fri May 7 10:36:23 PDT 2021
On Fri, 07 May 2021 12:02:57 +0100,
Marc Zyngier <maz at kernel.org> wrote:
>
> On Fri, 07 May 2021 10:58:23 +0100,
> Shaokun Zhang <zhangshaokun at hisilicon.com> wrote:
> >
> > Hi Marc,
> >
> > Thanks for your quick reply.
> >
> > On 2021/5/7 17:03, Marc Zyngier wrote:
> > > On Fri, 07 May 2021 06:57:04 +0100,
> > > Shaokun Zhang <zhangshaokun at hisilicon.com> wrote:
> > >>
> > >> [This letter comes from Nianyao Tang]
> > >>
> > >> Hi,
> > >>
> > >> Using GICv4/4.1 and msi capability, guest vf driver requires 3
> > >> vectors and enable msi, will lead to guest stuck.
> > >
> > > Stuck how?
> >
> > Guest serial does not response anymore and guest network shutdown.
> >
> > >
> > >> Qemu gets number of interrupts from Multiple Message Capable field
> > >> set by guest. This field is aligned to a power of 2(if a function
> > >> requires 3 vectors, it initializes it to 2).
> > >
> > > So I guess this is a MultiMSI device with 4 vectors, right?
> > >
> >
> > Yes, it can support maximum of 32 msi interrupts, and vf driver only use 3 msi.
> >
> > >> However, guest driver just sends 3 mapi-cmd to vits and 3 ite
> > >> entries is recorded in host. Vfio initializes msi interrupts using
> > >> the number of interrupts 4 provide by qemu. When it comes to the
> > >> 4th msi without ite in vits, in irq_bypass_register_producer,
> > >> producer and consumer will __connect fail, due to find_ite fail, and
> > >> do not resume guest.
> > >
> > > Let me rephrase this to check that I understand it:
> > > - The device has 4 vectors
> > > - The guest only create mappings for 3 of them
> > > - VFIO calls kvm_vgic_v4_set_forwarding() for each vector
> > > - KVM doesn't have a mapping for the 4th vector and returns an error
> > > - VFIO disable this 4th vector
> > >
> > > Is that correct? If yes, I don't understand why that impacts the guest
> > > at all. From what I can see, vfio_msi_set_vector_signal() just prints
> > > a message on the console and carries on.
> > >
> >
> > function calls:
> > --> vfio_msi_set_vector_signal
> > --> irq_bypass_register_producer
> > -->__connect
> >
> > in __connect, add_producer finally calls kvm_vgic_v4_set_forwarding
> > and fails to get the 4th mapping. When add_producer fail, it does
> > not call cons->start, calls kvm_arch_irq_bypass_start and then
> > kvm_arm_resume_guest.
>
> [+Eric, who wrote the irq_bypass infrastructure.]
>
> Ah, so the guest is actually paused, not in a livelock situation
> (which is how I interpreted "stuck").
>
> I think we should handle this case gracefully, as there should be no
> expectation that the guest will be using this interrupt. Given that
> VFIO seems to be pretty unfazed when a producer fails, I'm temped to
> do the same thing and restart the guest.
>
> Also, __disconnect doesn't care about errors, so why should __connect
> have this odd behaviour?
>
> Can you please try this? It is completely untested (and I think the
> del_consumer call is odd, which is why I've also dropped it).
>
> Eric, what do you think?
Adding Zhu, Jason, MST to the party. It all seems to be caused by this
commit:
commit a979a6aa009f3c99689432e0cdb5402a4463fb88
Author: Zhu Lingshan <lingshan.zhu at intel.com>
Date: Fri Jul 31 14:55:33 2020 +0800
irqbypass: do not start cons/prod when failed connect
If failed to connect, there is no need to start consumer nor
producer.
Signed-off-by: Zhu Lingshan <lingshan.zhu at intel.com>
Suggested-by: Jason Wang <jasowang at redhat.com>
Link: https://lore.kernel.org/r/20200731065533.4144-7-lingshan.zhu@intel.com
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
Zhu, I'd really like to understand why you think it is OK not to
restart consumer and producers when a connection has failed to be
established between the two?
In the case of KVM/arm64, this results in the guest being forever
suspended and never resumed. That's obviously not an acceptable
regression, as there is a number of benign reasons for a connect to
fail.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list