[PATCH] KVM: arm64: Handle permission faults with guest_memfd

Fri May 1 07:18:16 PDT 2026

Hi Fuad,

On Thu, Apr 30, 2026 at 03:48:21PM +0100, Fuad Tabba wrote:
> Hi Alexandru,
> 
> On Thu, 30 Apr 2026 at 14:24, Alexandru Elisei <alexandru.elisei at arm.com> wrote:
> >
> > gmem_abort() calls kvm_pgtable_stage2_map() to make changes to stage 2. It
> > does this for both relaxing permissions on an existing mapping and to
> > install a missing mapping.
> >
> > kvm_pgtable_stage2_map() doesn't make changes to stage 2 if there is an
> > existing, valid entry and the new entry modifies only the permissions.
> > This is checked in:
> >
> > kvm_pgtable_stage2_map()
> >   stage2_map_walk_leaf()
> >      stage2_map_walker_try_leaf()
> >        stage2_pte_needs_update()
> >
> > and if only the permissions differ, kvm_pgtable_stage2_map() returns
> > -EAGAIN and KVM returns to the guest to replay the instruction. The
> > assumption is that a concurrent fault on a different VCPU already mapped
> > the faulting IPA, and replaying the instruction will either succeed, or
> > cause a permission fault, which should be handled with
> > kvm_pgtable_stage2_relax_perms().
> >
> > gmem_abort(), on a read or write fault on a system without DIC (instruction
> > cache invalidation required for data to instruction coherence), installs a
> > valid entry with read and write permissions, but without executable
> > permissions. On an execution fault on the same page, gmem_abort() attempts
> > to relax the permissions to allow execution, but calls
> > kvm_pgtable_stage2_map() to change the existing, valid, entry.
> > kvm_pgtable_stage2_map() returns -EAGAIN and KVM resumes execution from the
> > faulting instruction, which leads to an infinite loop of permission faults
> > on the same instruction.
> >
> > Allow the guest to make progress by using kvm_pgtable_stage2_relax_perms()
> > to relax permissions.
> >
> > Fixes: a7b57e099592 ("KVM: arm64: Handle guest_memfd-backed guest page faults")
> > Signed-off-by: Alexandru Elisei <alexandru.elisei at arm.com>
> > ---
> >
> > Lightly tested on an Orion O6 board, without pkvm, and without nested
> > virtualisation.
> >
> > Doesn't apply cleanly on top of a7b57e099592, I can send a patch for that if
> > needed.
> >
> >  arch/arm64/kvm/mmu.c | 24 +++++++++++++++++-------
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index d089c107d9b7..dff58de7703b 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1576,6 +1576,7 @@ struct kvm_s2_fault_desc {
> >  static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >  {
> >         bool write_fault, exec_fault;
> > +       bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu);
> >         enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> >         struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> > @@ -1587,10 +1588,12 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >         gfn_t gfn;
> >         int ret;
> >
> > -       memcache = get_mmu_memcache(s2fd->vcpu);
> > -       ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > -       if (ret)
> > -               return ret;
> > +       if (!perm_fault) {
> > +               memcache = get_mmu_memcache(s2fd->vcpu);
> > +               ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > +               if (ret)
> > +                       return ret;
> > +       }
> 
> memcache is now uninitialized when perm_fault, and only read in the
> !perm_fault branch below. It's safe today, but initializing it to NULL
> would silence any defensive analyzer and keep things robust against
> future churn.

Sure, I'll initialise memcache to NULL.

> 
> >
> >         if (s2fd->nested)
> >                 gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> > @@ -1631,9 +1634,16 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >                 goto out_unlock;
> >         }
> >
> > -       ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> > -                                                __pfn_to_phys(pfn), prot,
> > -                                                memcache, flags);
> > +       if (perm_fault) {
> > +               /* Preserve the software bits from the existing table entry. */
> > +               prot &= ~KVM_NV_GUEST_MAP_SZ;
> 
> Can you please phrase it the same way as the existing comment in
> user_mem_abort, as that phrasing more precisely describes the
> rationale? i.e.,

Of course.

> 
>    /*
>     * Drop the SW bits in favour of those stored in the
>     * PTE, which will be preserved.
>     */
> 
> > +               ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms(pgt, s2fd->fault_ipa,
> > +                                                               prot, flags));
> 
> nit: args wrapped inside KVM_PGT_FN() is inconsistent with the rest of
> the file (and with the else-branch immediately below). Every other
> call site uses name-only inside the macro. Both expand the same way,
> but it's better to stick to the convention.

This was a typo on my part, thanks for catching it.

> 
> > +       } else {
> > +               ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> > +                                                        __pfn_to_phys(pfn), prot,
> > +                                                        memcache, flags);
> > +       }
> >
> >  out_unlock:
> >         kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
> 
> With these fixed:
> 
> Reviewed-by: Fuad Tabba <tabba at google.com>

Thanks!

Alex