[PATCH v2 14/35] KVM: arm64: Handle aborts from protected VMs

Wed Mar 4 06:06:49 PST 2026

On Thu, Feb 12, 2026 at 10:37:19AM +0000, Alexandru Elisei wrote:
> On Mon, Jan 19, 2026 at 12:46:07PM +0000, Will Deacon wrote:
> > Introduce a new abort handler for resolving stage-2 page faults from
> > protected VMs by pinning and donating anonymous memory. This is
> > considerably simpler than the infamous user_mem_abort() as we only have
> > to deal with translation faults at the pte level.
> > 
> > Signed-off-by: Will Deacon <will at kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 89 ++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 81 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index a23a4b7f108c..b21a5bf3d104 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1641,6 +1641,74 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	return ret != -EAGAIN ? ret : 0;
> >  }
> >  
> > +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +		struct kvm_memory_slot *memslot, unsigned long hva)
> > +{
> > +	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
> > +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +	struct mm_struct *mm = current->mm;
> > +	struct kvm *kvm = vcpu->kvm;
> > +	void *hyp_memcache;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = prepare_mmu_memcache(vcpu, true, &hyp_memcache);
> > +	if (ret)
> > +		return -ENOMEM;
> > +
> > +	ret = account_locked_vm(mm, 1, true);
> > +	if (ret)
> > +		return ret;
> > +
> > +	mmap_read_lock(mm);
> > +	ret = pin_user_pages(hva, 1, flags, &page);
> > +	mmap_read_unlock(mm);
> 
> If the page is part of a large folio, the entire folio gets pinned here, not
> just the page returned by pin_user_pages(). Do you reckon that should be
> considered when calling account_locked_vm()?

I don't _think_ so.

Since we only ask for a single page when we call pin_user_pages(), the
folio refcount will be adjusted by 1, even for large folios. Trying to
adjust the accounting based on whether the pinned page forms part of a
large folio feels error-prone, not least because the migration triggered
by the longterm pin could actually end up splitting the folio but also
because we'd have to avoid double accounting on subsequent faults to the
same folio. It also feels fragile if the mm code is able to split
partially pinned folios in future (like it appears to be able to for
partially mapped folios).

> > +	if (ret == -EHWPOISON) {
> > +		kvm_send_hwpoison_signal(hva, PAGE_SHIFT);
> > +		ret = 0;
> > +		goto dec_account;
> > +	} else if (ret != 1) {
> > +		ret = -EFAULT;
> > +		goto dec_account;
> > +	} else if (!folio_test_swapbacked(page_folio(page))) {
> > +		/*
> > +		 * We really can't deal with page-cache pages returned by GUP
> > +		 * because (a) we may trigger writeback of a page for which we
> > +		 * no longer have access and (b) page_mkclean() won't find the
> > +		 * stage-2 mapping in the rmap so we can get out-of-whack with
> > +		 * the filesystem when marking the page dirty during unpinning
> > +		 * (see cc5095747edf ("ext4: don't BUG if someone dirty pages
> > +		 * without asking ext4 first")).
> 
> I've been trying to wrap my head around this. Would you mind providing a few
> more hints about what the issue is? I'm sure the approach is correct, it's
> likely just me not being familiar with the code.

The fundamental problem is that unmapping page-cache pages from the host
stage-2 can confuse filesystems who don't know that either the page is
now inaccessible (and so may attempt to access it) or that the page can
be accessed concurrently by the guest without updating the page state.

To fix those issues, we would need to support MMU notifiers for protected
memory but that would allow the host to mess with the guest stage-2
page-table, which breaks the security model that we're trying to uphold.

> > @@ -2190,15 +2258,20 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >  		goto out_unlock;
> >  	}
> >  
> > -	VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) &&
> > -			!write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu));
> > +	if (kvm_vm_is_protected(vcpu->kvm)) {
> > +		ret = pkvm_mem_abort(vcpu, fault_ipa, memslot, hva);
> 
> I guess the reason this comes after handling an access fault is because you want
> the WARN_ON() to trigger in pkvm_pgtable_stage2_mkyoung().

Right, we should only ever see translation faults for protected guests
and that's all that pkvm_mem_abort() is prepared to handle, so we call
it last.

Will