[PATCH v2 14/35] KVM: arm64: Handle aborts from protected VMs
Alexandru Elisei
alexandru.elisei at arm.com
Fri Mar 6 03:34:37 PST 2026
Hi Will,
On Wed, Mar 04, 2026 at 02:06:49PM +0000, Will Deacon wrote:
> On Thu, Feb 12, 2026 at 10:37:19AM +0000, Alexandru Elisei wrote:
> > On Mon, Jan 19, 2026 at 12:46:07PM +0000, Will Deacon wrote:
> > > Introduce a new abort handler for resolving stage-2 page faults from
> > > protected VMs by pinning and donating anonymous memory. This is
> > > considerably simpler than the infamous user_mem_abort() as we only have
> > > to deal with translation faults at the pte level.
> > >
> > > Signed-off-by: Will Deacon <will at kernel.org>
> > > ---
> > > arch/arm64/kvm/mmu.c | 89 ++++++++++++++++++++++++++++++++++++++++----
> > > 1 file changed, 81 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index a23a4b7f108c..b21a5bf3d104 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1641,6 +1641,74 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > return ret != -EAGAIN ? ret : 0;
> > > }
> > >
> > > +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > + struct kvm_memory_slot *memslot, unsigned long hva)
> > > +{
> > > + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
> > > + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > + struct mm_struct *mm = current->mm;
> > > + struct kvm *kvm = vcpu->kvm;
> > > + void *hyp_memcache;
> > > + struct page *page;
> > > + int ret;
> > > +
> > > + ret = prepare_mmu_memcache(vcpu, true, &hyp_memcache);
> > > + if (ret)
> > > + return -ENOMEM;
> > > +
> > > + ret = account_locked_vm(mm, 1, true);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + mmap_read_lock(mm);
> > > + ret = pin_user_pages(hva, 1, flags, &page);
> > > + mmap_read_unlock(mm);
> >
> > If the page is part of a large folio, the entire folio gets pinned here, not
> > just the page returned by pin_user_pages(). Do you reckon that should be
> > considered when calling account_locked_vm()?
>
> I don't _think_ so.
>
> Since we only ask for a single page when we call pin_user_pages(), the
> folio refcount will be adjusted by 1, even for large folios. Trying to
The large folios, **_pincount** is adjusted by 1 with FOLL_LONGTERM. For
non-large folio, the refcount is increased by GUP_PIN_COUNTING_BIAS == 1024
(try_grab_folio() is where the magic happens).
> adjust the accounting based on whether the pinned page forms part of a
> large folio feels error-prone, not least because the migration triggered
> by the longterm pin could actually end up splitting the folio but also
Hmm.. as far as I can tell pin_user_pages() uses MIGRATE_SYNC to migrate folios
not suitable for longterm pinning, and after migration has completed it attemps
to pin the userspace address again.
Also, split_folio() and friends cannot split folio_maybe_dma_pinned_folio(),
according to the comments for the various functions.
> because we'd have to avoid double accounting on subsequent faults to the
> same folio. It also feels fragile if the mm code is able to split
> partially pinned folios in future (like it appears to be able to for
> partially mapped folios).
I'm not sure why mm would want to split a folio_maybe_dma_pinned_folio(). But
I'm far from being a mm expert, so I do understand why relying on this might
feel fragile.
>
> > > + if (ret == -EHWPOISON) {
> > > + kvm_send_hwpoison_signal(hva, PAGE_SHIFT);
> > > + ret = 0;
> > > + goto dec_account;
> > > + } else if (ret != 1) {
> > > + ret = -EFAULT;
> > > + goto dec_account;
> > > + } else if (!folio_test_swapbacked(page_folio(page))) {
> > > + /*
> > > + * We really can't deal with page-cache pages returned by GUP
> > > + * because (a) we may trigger writeback of a page for which we
> > > + * no longer have access and (b) page_mkclean() won't find the
> > > + * stage-2 mapping in the rmap so we can get out-of-whack with
> > > + * the filesystem when marking the page dirty during unpinning
> > > + * (see cc5095747edf ("ext4: don't BUG if someone dirty pages
> > > + * without asking ext4 first")).
> >
> > I've been trying to wrap my head around this. Would you mind providing a few
> > more hints about what the issue is? I'm sure the approach is correct, it's
> > likely just me not being familiar with the code.
>
> The fundamental problem is that unmapping page-cache pages from the host
> stage-2 can confuse filesystems who don't know that either the page is
> now inaccessible (and so may attempt to access it) or that the page can
> be accessed concurrently by the guest without updating the page state.
>
> To fix those issues, we would need to support MMU notifiers for protected
> memory but that would allow the host to mess with the guest stage-2
> page-table, which breaks the security model that we're trying to uphold.
Aha, got it, thanks for the explanation!
Alex
>
> > > @@ -2190,15 +2258,20 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > goto out_unlock;
> > > }
> > >
> > > - VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) &&
> > > - !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu));
> > > + if (kvm_vm_is_protected(vcpu->kvm)) {
> > > + ret = pkvm_mem_abort(vcpu, fault_ipa, memslot, hva);
> >
> > I guess the reason this comes after handling an access fault is because you want
> > the WARN_ON() to trigger in pkvm_pgtable_stage2_mkyoung().
>
> Right, we should only ever see translation faults for protected guests
> and that's all that pkvm_mem_abort() is prepared to handle, so we call
> it last.
>
> Will
More information about the linux-arm-kernel
mailing list