[PATCH v2 14/35] KVM: arm64: Handle aborts from protected VMs

Fri Mar 6 03:34:37 PST 2026

Hi Will,

On Wed, Mar 04, 2026 at 02:06:49PM +0000, Will Deacon wrote:
> On Thu, Feb 12, 2026 at 10:37:19AM +0000, Alexandru Elisei wrote:
> > On Mon, Jan 19, 2026 at 12:46:07PM +0000, Will Deacon wrote:
> > > Introduce a new abort handler for resolving stage-2 page faults from
> > > protected VMs by pinning and donating anonymous memory. This is
> > > considerably simpler than the infamous user_mem_abort() as we only have
> > > to deal with translation faults at the pte level.
> > > 
> > > Signed-off-by: Will Deacon <will at kernel.org>
> > > ---
> > >  arch/arm64/kvm/mmu.c | 89 ++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 81 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index a23a4b7f108c..b21a5bf3d104 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1641,6 +1641,74 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  	return ret != -EAGAIN ? ret : 0;
> > >  }
> > >  
> > > +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > +		struct kvm_memory_slot *memslot, unsigned long hva)
> > > +{
> > > +	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
> > > +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > +	struct mm_struct *mm = current->mm;
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	void *hyp_memcache;
> > > +	struct page *page;
> > > +	int ret;
> > > +
> > > +	ret = prepare_mmu_memcache(vcpu, true, &hyp_memcache);
> > > +	if (ret)
> > > +		return -ENOMEM;
> > > +
> > > +	ret = account_locked_vm(mm, 1, true);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	mmap_read_lock(mm);
> > > +	ret = pin_user_pages(hva, 1, flags, &page);
> > > +	mmap_read_unlock(mm);
> > 
> > If the page is part of a large folio, the entire folio gets pinned here, not
> > just the page returned by pin_user_pages(). Do you reckon that should be
> > considered when calling account_locked_vm()?
> 
> I don't _think_ so.
> 
> Since we only ask for a single page when we call pin_user_pages(), the
> folio refcount will be adjusted by 1, even for large folios. Trying to

The large folios, **_pincount** is adjusted by 1 with FOLL_LONGTERM. For
non-large folio, the refcount is increased by GUP_PIN_COUNTING_BIAS == 1024
(try_grab_folio() is where the magic happens).

> adjust the accounting based on whether the pinned page forms part of a
> large folio feels error-prone, not least because the migration triggered
> by the longterm pin could actually end up splitting the folio but also

Hmm.. as far as I can tell pin_user_pages() uses MIGRATE_SYNC to migrate folios
not suitable for longterm pinning, and after migration has completed it attemps
to pin the userspace address again.

Also, split_folio() and friends cannot split folio_maybe_dma_pinned_folio(),
according to the comments for the various functions.

> because we'd have to avoid double accounting on subsequent faults to the
> same folio. It also feels fragile if the mm code is able to split
> partially pinned folios in future (like it appears to be able to for
> partially mapped folios).

I'm not sure why mm would want to split a folio_maybe_dma_pinned_folio(). But
I'm far from being a mm expert, so I do understand why relying on this might
feel fragile.

> 
> > > +	if (ret == -EHWPOISON) {
> > > +		kvm_send_hwpoison_signal(hva, PAGE_SHIFT);
> > > +		ret = 0;
> > > +		goto dec_account;
> > > +	} else if (ret != 1) {
> > > +		ret = -EFAULT;
> > > +		goto dec_account;
> > > +	} else if (!folio_test_swapbacked(page_folio(page))) {
> > > +		/*
> > > +		 * We really can't deal with page-cache pages returned by GUP
> > > +		 * because (a) we may trigger writeback of a page for which we
> > > +		 * no longer have access and (b) page_mkclean() won't find the
> > > +		 * stage-2 mapping in the rmap so we can get out-of-whack with
> > > +		 * the filesystem when marking the page dirty during unpinning
> > > +		 * (see cc5095747edf ("ext4: don't BUG if someone dirty pages
> > > +		 * without asking ext4 first")).
> > 
> > I've been trying to wrap my head around this. Would you mind providing a few
> > more hints about what the issue is? I'm sure the approach is correct, it's
> > likely just me not being familiar with the code.
> 
> The fundamental problem is that unmapping page-cache pages from the host
> stage-2 can confuse filesystems who don't know that either the page is
> now inaccessible (and so may attempt to access it) or that the page can
> be accessed concurrently by the guest without updating the page state.
> 
> To fix those issues, we would need to support MMU notifiers for protected
> memory but that would allow the host to mess with the guest stage-2
> page-table, which breaks the security model that we're trying to uphold.

Aha, got it, thanks for the explanation!

Alex

> 
> > > @@ -2190,15 +2258,20 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > >  		goto out_unlock;
> > >  	}
> > >  
> > > -	VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) &&
> > > -			!write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu));
> > > +	if (kvm_vm_is_protected(vcpu->kvm)) {
> > > +		ret = pkvm_mem_abort(vcpu, fault_ipa, memslot, hva);
> > 
> > I guess the reason this comes after handling an access fault is because you want
> > the WARN_ON() to trigger in pkvm_pgtable_stage2_mkyoung().
> 
> Right, we should only ever see translation faults for protected guests
> and that's all that pkvm_mem_abort() is prepared to handle, so we call
> it last.
> 
> Will