[PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells

Thu May 21 07:07:51 PDT 2026

On Thu, 2026-04-30 at 13:23 +0100, Marc Zyngier wrote:
> On Mon, 27 Apr 2026 17:11:30 +0100,
> Sascha Bischoff <Sascha.Bischoff at arm.com> wrote:
> > 
> > Each GICv5 VM requires a valid VM Table Entry (VMTE). The VM Table
> > itself is allocated during probe time, but a VM needs to provision
> > a
> > VMTE before it is able to properly run (PPIs will work, but nothing
> > else will - and PPIs only are not useful!).
> > 
> > The correct time for setting up the VMTE is during VM
> > initialisation. For GICv5, this is vgic_v5_init(). Each VM needs a
> > VM
> > ID - this is actually the index into the VM Table so it is how a
> > specific VMTE is selected too. As part of vgic_v5_init get a VM ID
> > via
> > vgic_v5_allocate_vm_id(), which internally uses an IDA to select an
> > unused VM ID (and hence VMTE) within the range of allowed VM IDs.
> > 
> > Once the VM ID has been allocated, the doorbell domain for the VM
> > is
> > allocated, and each of the doorbells itself is allocated and
> > assigned
> > to a vcpu.
> > 
> > Assuming everything up until this point has succeeded, initialise
> > the
> > VMTE. Internally this allocates the additional data structures
> > required by the hardware - the VM Descriptor, VPE Table, etc. This
> > VMTE is then made valid via the IRS's MMIO interface. Finally, all
> > VPEs are allocated within the VPET.
> > 
> > On teardown, this process is reversed again. The VMTE is made
> > invalid,
> > the VPEs are freed, the doorbells are released and the domain torn
> > down, and finally the VM ID is released. The latter allows the VM
> > ID
> > and VMTE to be reused for a future VM.
> > 
> > Signed-off-by: Sascha Bischoff <sascha.bischoff at arm.com>
> > ---
> >  arch/arm64/kvm/vgic/vgic-v5.c | 146 +++++++++++++++++++++++++++++-
> > ----
> >  1 file changed, 128 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/vgic/vgic-v5.c
> > b/arch/arm64/kvm/vgic/vgic-v5.c
> > index 2fc6fa4df034f..9347bc6895223 100644
> > --- a/arch/arm64/kvm/vgic/vgic-v5.c
> > +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> > @@ -518,6 +518,18 @@ static int vgic_v5_irs_vpe_cr0_update(int
> > vm_id, int vpe_id, u32 cr0)
> >   return 0;
> >  }
> >  
> > +static irqreturn_t db_handler(int irq, void *data)
> > +{
> > + struct kvm_vcpu *vcpu = data;
> > +
> > + WRITE_ONCE(vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db_fired, true);
> > +
> > + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
> > + kvm_vcpu_kick(vcpu);
> > +
> > + return IRQ_HANDLED;
> > +}
> > +
> 
> I think it'd make more sense if the doorbell
> handling/requesting/freeing was one patch, or at least a set of
> consecutive patches in the series.
> 
> As it is now, it is very hard to keep track of things. You have part
> of it in the previous patch, the requesting and handling here, and
> probably the freeing in some other patch I haven't seen.
> 
> >  static int vgic_v5_send_command(struct kvm_vcpu *vcpu,
> >   enum gicv5_vcpu_info_cmd_type type)
> >  {
> > @@ -726,26 +738,46 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
> >   }
> >  }
> >  
> > -int vgic_v5_init(struct kvm *kvm)
> > +int vgic_v5_map_resources(struct kvm *kvm)
> >  {
> > - struct kvm_vcpu *vcpu;
> > - unsigned long idx;
> > - int ret;
> > + if (!vgic_initialized(kvm))
> > + return -EBUSY;
> >  
> > - if (vgic_initialized(kvm))
> > - return 0;
> > + return 0;
> > +}
> 
> Pointless code movement?

Very pointless. Removed that.

> 
> > 
> > - ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> > - if (ret)
> > - return ret;
> > +/*
> > + * Claim and populate a VMTE (optionally making a new L2 VMT
> > valid), create VPE
> > + * doorbells, allocate VPET and populate for each VPE. Finally, we
> > also init the
> > + * vIRS, which means allocating and making the virtual SPI IST
> > valid.
> > + *
> > + * Note: We do need to put the cart before the horse here. The VPE
> > doorbells are
> > + * our conduit for communication with the IRS, which means we need
> > to have those
> > + * before making the VMTE valid.
> > + *
> > + * On failure, we clean up in the teardown path
> > (vgic_v5_teardown()).
> > + */
> > +int vgic_v5_init(struct kvm *kvm)
> > +{
> > + int nr_vcpus, ret = 0;
> > + struct kvm_vcpu *vcpu, *vcpu0;
> > + unsigned long i;
> > + struct irq_data *d;
> > + unsigned int db_virq;
> > +
> > + nr_vcpus = atomic_read(&kvm->online_vcpus);
> > + if (nr_vcpus == 0)
> > + return -ENODEV;
> >  
> > - kvm_for_each_vcpu(idx, vcpu, kvm) {
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
> >   if (vcpu_has_nv(vcpu)) {
> >   kvm_err("Nested GICv5 VMs are currently unsupported\n");
> >   return -EINVAL;
> >   }
> >   }
> >  
> > + kvm->arch.vgic.gicv5_vm.nr_vpes = nr_vcpus;
> 
> Why do we need to track the number of vcpus separately from what KVM
> already does? GICv4 does it because a lot of the state is managed by
> the irqchip driver, but that's not the case here. I hope we can come
> up with a slightly simpler model with GICv5.

Right, it isn't needed. I've removed it from here (and from the earlier
commits that introduced it/used it).

> 
> > +
> >   /* We only allow userspace to drive the SW_PPI, if it is
> > implemented. */
> >   bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis,
> >       VGIC_V5_NR_PRIVATE_IRQS);
> > @@ -754,20 +786,98 @@ int vgic_v5_init(struct kvm *kvm)
> >      kvm->arch.vgic.gicv5_vm.userspace_ppis,
> >      ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
> >  
> > - return 0;
> > + ret = vgic_v5_allocate_vm_id(kvm);
> > + if (ret) {
> > + kvm_err("Maximum number of GICv5 VMs reached!\n");
> > + return ret;
> > + }
> 
> I'd rather we don't scream on the console when running out of
> VMIDs. If we're at capacity, so be it. That's not an error worth
> spamming the console over.

Alright. Have removed the printing here.

> 
> > +
> > + ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> > + if (ret)
> > + return ret;
> 
> Who is freeing the VMID?

We are in the vgic_v5_teardown() path.

> 
> > +
> > + /*
> > + * Allocate VPE doorbells first - these are our conduit for
> > + * communicating with the host irqchip driver.
> > + */
> > + db_virq = irq_domain_alloc_irqs(kvm->arch.vgic.gicv5_vm.domain,
> > + nr_vcpus, NUMA_NO_NODE,
> > + &kvm->arch.vgic.gicv5_vm);
> > + if (db_virq < 0) {
> > + /* Simplify teardown by doing this early! */
> > + vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> > + return db_virq;
> > + }
> > +
> > + kvm->arch.vgic.gicv5_vm.vpe_db_base = db_virq;
> > +
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > + d = irq_domain_get_irq_data(kvm->arch.vgic.gicv5_vm.domain,
> > +     db_virq + i);
> > + irq_set_status_flags(db_virq + i, IRQ_NOAUTOEN);
> > +
> > + ret = request_irq(db_virq + i, db_handler, 0, "vcpu", vcpu);
> > + if (ret)
> > + return ret;
> > +
> > + /* Stash it with the VCPU for easy retrieval */
> > + vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db = db_virq + i;
> > + }
> > +
> > + /* Populate VMTE (with VPET and VM descriptor) */
> > + ret = vgic_v5_vmte_init(kvm);
> > + if (ret)
> > + return ret;
> > +
> > + /* We pick the first vcpu to make the VMTE valid - any would do
> > */
> > + vcpu0 = kvm_get_vcpu(kvm, 0);
> > + ret = vgic_v5_send_command(vcpu0, VMTE_MAKE_VALID);
> > + if (ret)
> > + return ret;
> > +
> > + /* Loop over all VPEs, allocate/populate their data structures */
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > + ret = vgic_v5_vmte_alloc_vpe(vcpu);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return ret;
> 
> I'm very worried about the error handling of that function. Who is
> responsible for cleaning up the mess when this fails?

I'd been working on the flawed assumption that vgic_v5_teardown() will
be called in response to an init failure. I've now reworked this to
explicitly roll back everything on a failure by proactively calling
vgic_v5_teardown(), which has also been made safe to call again in the
teardown path.

Thanks,
Sascha

> 
> >  }
> >  
> >  void vgic_v5_teardown(struct kvm *kvm)
> >  {
> > - vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> > -}
> > + struct kvm_vcpu *vcpu, *vcpu0;
> > + struct vgic_dist *dist = &kvm->arch.vgic;
> > + unsigned long i;
> > + int rc;
> >  
> > -int vgic_v5_map_resources(struct kvm *kvm)
> > -{
> > - if (!vgic_initialized(kvm))
> > - return -EBUSY;
> > + /*
> > + * If the VM's ID isn't valid, then we failed init very early.
> > Nothing
> > + * to do here.
> > + */
> > + if (!kvm->arch.vgic.gicv5_vm.vm_id_valid)
> > + return;
> >  
> > - return 0;
> > + if (kvm->arch.vgic.gicv5_vm.vmte_allocated) {
> > + /* Make the VM invalid  */
> > + vcpu0 = kvm_get_vcpu(kvm, 0);
> > + rc = vgic_v5_send_command(vcpu0, VMTE_MAKE_INVALID);
> > + if (rc)
> > + kvm_err("could not make VMTE invalid\n");
> > +
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > + if (vgic_v5_vmte_free_vpe(vcpu))
> > + kvm_err("Failed to free VPE\n");
> > + }
> > +
> > + if (vgic_v5_vmte_release(kvm))
> > + kvm_err("Failed to release VM 0x%x\n", dist->gicv5_vm.vm_id);
> > + }
> > +
> > + vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> > +
> > + vgic_v5_release_vm_id(kvm);
> >  }
> >  
> >  int vgic_v5_finalize_ppi_state(struct kvm *kvm)
> 
> Thanks,
> 
>  M.
>