[PATCH] KVM: arm64: Handle permission faults with guest_memfd
Fuad Tabba
tabba at google.com
Thu Apr 30 07:48:21 PDT 2026
Hi Alexandru,
On Thu, 30 Apr 2026 at 14:24, Alexandru Elisei <alexandru.elisei at arm.com> wrote:
>
> gmem_abort() calls kvm_pgtable_stage2_map() to make changes to stage 2. It
> does this for both relaxing permissions on an existing mapping and to
> install a missing mapping.
>
> kvm_pgtable_stage2_map() doesn't make changes to stage 2 if there is an
> existing, valid entry and the new entry modifies only the permissions.
> This is checked in:
>
> kvm_pgtable_stage2_map()
> stage2_map_walk_leaf()
> stage2_map_walker_try_leaf()
> stage2_pte_needs_update()
>
> and if only the permissions differ, kvm_pgtable_stage2_map() returns
> -EAGAIN and KVM returns to the guest to replay the instruction. The
> assumption is that a concurrent fault on a different VCPU already mapped
> the faulting IPA, and replaying the instruction will either succeed, or
> cause a permission fault, which should be handled with
> kvm_pgtable_stage2_relax_perms().
>
> gmem_abort(), on a read or write fault on a system without DIC (instruction
> cache invalidation required for data to instruction coherence), installs a
> valid entry with read and write permissions, but without executable
> permissions. On an execution fault on the same page, gmem_abort() attempts
> to relax the permissions to allow execution, but calls
> kvm_pgtable_stage2_map() to change the existing, valid, entry.
> kvm_pgtable_stage2_map() returns -EAGAIN and KVM resumes execution from the
> faulting instruction, which leads to an infinite loop of permission faults
> on the same instruction.
>
> Allow the guest to make progress by using kvm_pgtable_stage2_relax_perms()
> to relax permissions.
>
> Fixes: a7b57e099592 ("KVM: arm64: Handle guest_memfd-backed guest page faults")
> Signed-off-by: Alexandru Elisei <alexandru.elisei at arm.com>
> ---
>
> Lightly tested on an Orion O6 board, without pkvm, and without nested
> virtualisation.
>
> Doesn't apply cleanly on top of a7b57e099592, I can send a patch for that if
> needed.
>
> arch/arm64/kvm/mmu.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..dff58de7703b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1576,6 +1576,7 @@ struct kvm_s2_fault_desc {
> static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> {
> bool write_fault, exec_fault;
> + bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu);
> enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> @@ -1587,10 +1588,12 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> gfn_t gfn;
> int ret;
>
> - memcache = get_mmu_memcache(s2fd->vcpu);
> - ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> - if (ret)
> - return ret;
> + if (!perm_fault) {
> + memcache = get_mmu_memcache(s2fd->vcpu);
> + ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> + if (ret)
> + return ret;
> + }
memcache is now uninitialized when perm_fault, and only read in the
!perm_fault branch below. It's safe today, but initializing it to NULL
would silence any defensive analyzer and keep things robust against
future churn.
>
> if (s2fd->nested)
> gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> @@ -1631,9 +1634,16 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> goto out_unlock;
> }
>
> - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> - __pfn_to_phys(pfn), prot,
> - memcache, flags);
> + if (perm_fault) {
> + /* Preserve the software bits from the existing table entry. */
> + prot &= ~KVM_NV_GUEST_MAP_SZ;
Can you please phrase it the same way as the existing comment in
user_mem_abort, as that phrasing more precisely describes the
rationale? i.e.,
/*
* Drop the SW bits in favour of those stored in the
* PTE, which will be preserved.
*/
> + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms(pgt, s2fd->fault_ipa,
> + prot, flags));
nit: args wrapped inside KVM_PGT_FN() is inconsistent with the rest of
the file (and with the else-branch immediately below). Every other
call site uses name-only inside the macro. Both expand the same way,
but it's better to stick to the convention.
> + } else {
> + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> + __pfn_to_phys(pfn), prot,
> + memcache, flags);
> + }
>
> out_unlock:
> kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
With these fixed:
Reviewed-by: Fuad Tabba <tabba at google.com>
Cheers,
/fuad
>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> --
> 2.54.0
>
More information about the linux-arm-kernel
mailing list