[PATCH] KVM: arm64: Handle permission faults with guest_memfd

Thu Apr 30 07:48:21 PDT 2026

Hi Alexandru,

On Thu, 30 Apr 2026 at 14:24, Alexandru Elisei <alexandru.elisei at arm.com> wrote:
>
> gmem_abort() calls kvm_pgtable_stage2_map() to make changes to stage 2. It
> does this for both relaxing permissions on an existing mapping and to
> install a missing mapping.
>
> kvm_pgtable_stage2_map() doesn't make changes to stage 2 if there is an
> existing, valid entry and the new entry modifies only the permissions.
> This is checked in:
>
> kvm_pgtable_stage2_map()
>   stage2_map_walk_leaf()
>      stage2_map_walker_try_leaf()
>        stage2_pte_needs_update()
>
> and if only the permissions differ, kvm_pgtable_stage2_map() returns
> -EAGAIN and KVM returns to the guest to replay the instruction. The
> assumption is that a concurrent fault on a different VCPU already mapped
> the faulting IPA, and replaying the instruction will either succeed, or
> cause a permission fault, which should be handled with
> kvm_pgtable_stage2_relax_perms().
>
> gmem_abort(), on a read or write fault on a system without DIC (instruction
> cache invalidation required for data to instruction coherence), installs a
> valid entry with read and write permissions, but without executable
> permissions. On an execution fault on the same page, gmem_abort() attempts
> to relax the permissions to allow execution, but calls
> kvm_pgtable_stage2_map() to change the existing, valid, entry.
> kvm_pgtable_stage2_map() returns -EAGAIN and KVM resumes execution from the
> faulting instruction, which leads to an infinite loop of permission faults
> on the same instruction.
>
> Allow the guest to make progress by using kvm_pgtable_stage2_relax_perms()
> to relax permissions.
>
> Fixes: a7b57e099592 ("KVM: arm64: Handle guest_memfd-backed guest page faults")
> Signed-off-by: Alexandru Elisei <alexandru.elisei at arm.com>
> ---
>
> Lightly tested on an Orion O6 board, without pkvm, and without nested
> virtualisation.
>
> Doesn't apply cleanly on top of a7b57e099592, I can send a patch for that if
> needed.
>
>  arch/arm64/kvm/mmu.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..dff58de7703b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1576,6 +1576,7 @@ struct kvm_s2_fault_desc {
>  static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>  {
>         bool write_fault, exec_fault;
> +       bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu);
>         enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>         struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> @@ -1587,10 +1588,12 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>         gfn_t gfn;
>         int ret;
>
> -       memcache = get_mmu_memcache(s2fd->vcpu);
> -       ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> -       if (ret)
> -               return ret;
> +       if (!perm_fault) {
> +               memcache = get_mmu_memcache(s2fd->vcpu);
> +               ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> +               if (ret)
> +                       return ret;
> +       }

memcache is now uninitialized when perm_fault, and only read in the
!perm_fault branch below. It's safe today, but initializing it to NULL
would silence any defensive analyzer and keep things robust against
future churn.

>
>         if (s2fd->nested)
>                 gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> @@ -1631,9 +1634,16 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>                 goto out_unlock;
>         }
>
> -       ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> -                                                __pfn_to_phys(pfn), prot,
> -                                                memcache, flags);
> +       if (perm_fault) {
> +               /* Preserve the software bits from the existing table entry. */
> +               prot &= ~KVM_NV_GUEST_MAP_SZ;

Can you please phrase it the same way as the existing comment in
user_mem_abort, as that phrasing more precisely describes the
rationale? i.e.,

   /*
    * Drop the SW bits in favour of those stored in the
    * PTE, which will be preserved.
    */

> +               ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms(pgt, s2fd->fault_ipa,
> +                                                               prot, flags));

nit: args wrapped inside KVM_PGT_FN() is inconsistent with the rest of
the file (and with the else-branch immediately below). Every other
call site uses name-only inside the macro. Both expand the same way,
but it's better to stick to the convention.

> +       } else {
> +               ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> +                                                        __pfn_to_phys(pfn), prot,
> +                                                        memcache, flags);
> +       }
>
>  out_unlock:
>         kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);

With these fixed:

Reviewed-by: Fuad Tabba <tabba at google.com>

Cheers,
/fuad

>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731

> --
> 2.54.0
>