Question on guest enable msi fail when using GICv4/4.1

Zhu, Lingshan lingshan.zhu at intel.com
Fri May 7 23:56:58 PDT 2021



On 5/8/2021 9:51 AM, Jason Wang wrote:
>
> 在 2021/5/8 上午1:36, Marc Zyngier 写道:
>> On Fri, 07 May 2021 12:02:57 +0100,
>> Marc Zyngier <maz at kernel.org> wrote:
>>> On Fri, 07 May 2021 10:58:23 +0100,
>>> Shaokun Zhang <zhangshaokun at hisilicon.com> wrote:
>>>> Hi Marc,
>>>>
>>>> Thanks for your quick reply.
>>>>
>>>> On 2021/5/7 17:03, Marc Zyngier wrote:
>>>>> On Fri, 07 May 2021 06:57:04 +0100,
>>>>> Shaokun Zhang <zhangshaokun at hisilicon.com> wrote:
>>>>>> [This letter comes from Nianyao Tang]
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Using GICv4/4.1 and msi capability, guest vf driver requires 3
>>>>>> vectors and enable msi, will lead to guest stuck.
>>>>> Stuck how?
>>>> Guest serial does not response anymore and guest network shutdown.
>>>>
>>>>>> Qemu gets number of interrupts from Multiple Message Capable field
>>>>>> set by guest. This field is aligned to a power of 2(if a function
>>>>>> requires 3 vectors, it initializes it to 2).
>>>>> So I guess this is a MultiMSI device with 4 vectors, right?
>>>>>
>>>> Yes, it can support maximum of 32 msi interrupts, and vf driver 
>>>> only use 3 msi.
>>>>
>>>>>> However, guest driver just sends 3 mapi-cmd to vits and 3 ite
>>>>>> entries is recorded in host.  Vfio initializes msi interrupts using
>>>>>> the number of interrupts 4 provide by qemu.  When it comes to the
>>>>>> 4th msi without ite in vits, in irq_bypass_register_producer,
>>>>>> producer and consumer will __connect fail, due to find_ite fail, and
>>>>>> do not resume guest.
>>>>> Let me rephrase this to check that I understand it:
>>>>> - The device has 4 vectors
>>>>> - The guest only create mappings for 3 of them
>>>>> - VFIO calls kvm_vgic_v4_set_forwarding() for each vector
>>>>> - KVM doesn't have a mapping for the 4th vector and returns an error
>>>>> - VFIO disable this 4th vector
>>>>>
>>>>> Is that correct? If yes, I don't understand why that impacts the 
>>>>> guest
>>>>> at all. From what I can see, vfio_msi_set_vector_signal() just prints
>>>>> a message on the console and carries on.
>>>>>
>>>> function calls:
>>>> --> vfio_msi_set_vector_signal
>>>>     --> irq_bypass_register_producer
>>>>        -->__connect
>>>>
>>>> in __connect, add_producer finally calls kvm_vgic_v4_set_forwarding
>>>> and fails to get the 4th mapping. When add_producer fail, it does
>>>> not call cons->start, calls kvm_arch_irq_bypass_start and then
>>>> kvm_arm_resume_guest.
>>> [+Eric, who wrote the irq_bypass infrastructure.]
>>>
>>> Ah, so the guest is actually paused, not in a livelock situation
>>> (which is how I interpreted "stuck").
>>>
>>> I think we should handle this case gracefully, as there should be no
>>> expectation that the guest will be using this interrupt. Given that
>>> VFIO seems to be pretty unfazed when a producer fails, I'm temped to
>>> do the same thing and restart the guest.
>>>
>>> Also, __disconnect doesn't care about errors, so why should __connect
>>> have this odd behaviour?
>>>
>>> Can you please try this? It is completely untested (and I think the
>>> del_consumer call is odd, which is why I've also dropped it).
>>>
>>> Eric, what do you think?
>> Adding Zhu, Jason, MST to the party. It all seems to be caused by this
>> commit:
>>
>> commit a979a6aa009f3c99689432e0cdb5402a4463fb88
>> Author: Zhu Lingshan <lingshan.zhu at intel.com>
>> Date:   Fri Jul 31 14:55:33 2020 +0800
>>
>>      irqbypass: do not start cons/prod when failed connect
>>           If failed to connect, there is no need to start consumer nor
>>      producer.
>>           Signed-off-by: Zhu Lingshan <lingshan.zhu at intel.com>
>>      Suggested-by: Jason Wang <jasowang at redhat.com>
>>      Link: 
>> https://lore.kernel.org/r/20200731065533.4144-7-lingshan.zhu@intel.com
>>      Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
>>
>>
>> Zhu, I'd really like to understand why you think it is OK not to
>> restart consumer and producers when a connection has failed to be
>> established between the two?
>
>
> My bad, I didn't check ARM code but it's not easy to infer that the 
> cons->start/stop is not a per consumer specific operation but a global 
> one like VM halting/resuming.
Hi Marc,

I will send out a patch to revert this commit as Jason suggested.

Thanks
>
>
>>
>> In the case of KVM/arm64, this results in the guest being forever
>> suspended and never resumed. That's obviously not an acceptable
>> regression, as there is a number of benign reasons for a connect to
>> fail.
>
>
> Let's revert this commit.
>
> Thanks
>
>
>>
>> Thanks,
>>
>>     M.
>>
>




More information about the linux-arm-kernel mailing list