KVM: Nested VGIC emulation leads to infinite IRQ exceptions

Thu Oct 2 08:08:09 PDT 2025

Hi Marc,

Marc Zyngier <maz at kernel.org> writes:

> On Thu, 02 Oct 2025 13:29:42 +0100,
> Volodymyr Babchuk <Volodymyr_Babchuk at epam.com> wrote:

[...]

>>  qemu-system-aar-3378    [085] d....   246.770720: vgic_populate_lr: VCPU 1 lr 0 = 90a000000000004f
>>  qemu-system-aar-3378    [085] d....   246.770720: vgic_populate_lr: VCPU 1 lr 1 = 90a000000000004e
>>  qemu-system-aar-3378    [085] d....   246.770720: vgic_populate_lr: VCPU 1 lr 2 = d0a000000000004a
>>  qemu-system-aar-3378    [085] d....   246.770720: vgic_populate_lr: VCPU 1 lr 3 = d0a000000000004b
>> 
>> As all LR entries have ACTIVE bit set, read from IAR1 will produce 1023,
>> of course. Problem is that Xen itself can't deactivate these 4 IRQs as
>> they are directed to DomU, so DomU should active them first. But DomU
>> can't do this as it is never executed.
>
> There is a flaw in your reasoning: if these are DomU (an L2 guest)
> interrupts, why would they impact Xen itself, which is L1? At the
> point of entering Xen, the HW LRs should only contain the virtual
> interrupts that are targeting Xen, and nothing else (the DomU
> interrupts being stored in the shadow LRs).

Agree, they **should**. But looks like they contain all IRQs that are
targeted that particular vCPU. I am still studying KVM's vGIC, so I
can't say why it this happening.

Mind you, that these are QEMUs IRQs, so from Xen's standpoint they are
HW interrupts and of course they are targeting Xen. Xen injects them to
a guest by writing vLR with HW bit enabled.

IMO, KVM should track these re-injected IRQs and remove them from Xen's
LRs. But this begs assumption that Xen (or any other nested hypervisor)
is well-behaved and will not try to deactive a IRQ that it already
injected to an own guest.

>
> I can't see so far how we'd end-up in that situation, given that we do
> a full context switch of the vgic context on each EL1/EL2 transition.
>
> Unless you are actually acknowledging the DomU interrupts in Xen and
> injecting them back into DomU? Which seems very odd as you don't have
> the HW bit set, which I'd expect if that was the case...

Isn't KVM doing the same? I mean, all HW IRQs are targeting hypervisor
and then being routed and re-injected into a guest. AFAIR, only LPIs can
be injected directly to a guest. And, as I said, IRQs in question are
generated by external QEMU, so they are considered HW interrupts by Xen.

>
>> I am not sure what is the correct fix, but I see two options:
>> 
>> - Prioritize timer IRQs so they always present in LRs
>> - De-prioritize ACTIVE IRQs so they are inserted into LRs last.
>> 
>> Looks like the second one is better.
>
> That's indeed something missing in KVM (I have long waited until
> someone would do it in my stead, but nobody seem to be bothered) but
> it isn't clear, from what you are describing, that this is the actual
> solution to your problem.
>

Okay, disregard my previous ideas. We can't willy-nilly remove ACTIVE
IRQs from LRs. So, probably we need some sort of heuristic to determine
if L1 hypervisor re-injects IRQ to a L2 guest. I think we can check HW
bit in vLR to determine this. In this case we can differentiate L1- and
L2- targeted IRQs during context switch from KVM to L1/L2 and fill LRs
accordingly.

Of course, as I said, in this case we'll rely on good behavior of L1
hypervisor, because it can try to EOI IRQ that it already injected in a
guest. This is not a huge deal if we are dealing with "virtual" HW
interrupts (generated by QEMU in this case), but it can be tricky with
real HW interrupts generated by a real HW device and injected all the
way to L2.

-- 
WBR, Volodymyr