[PATCH 33/89] KVM: arm64: Handle guest stage-2 page-tables entirely at EL2

Tue May 31 09:45:51 PDT 2022

On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
> On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
> > Now that EL2 is able to manage guest stage-2 page-tables, avoid
> > allocating a separate MMU structure in the host and instead introduce a
> > new fault handler which responds to guest stage-2 faults by sharing
> > GUP-pinned pages with the guest via a hypercall. These pages are
> > recovered (and unpinned) on guest teardown via the page reclaim
> > hypercall.
> > 
> > Signed-off-by: Will Deacon <will at kernel.org>
> > ---
> [..]
> > +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +			  unsigned long hva)
> > +{
> > +	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
> > +	struct kvm_pinned_page *ppage;
> > +	struct kvm *kvm = vcpu->kvm;
> > +	struct page *page;
> > +	u64 pfn;
> > +	int ret;
> > +
> > +	ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
> > +	if (ret)
> > +		return -ENOMEM;
> > +
> > +	ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
> > +	if (!ppage)
> > +		return -ENOMEM;
> > +
> > +	ret = account_locked_vm(mm, 1, true);
> > +	if (ret)
> > +		goto free_ppage;
> > +
> > +	mmap_read_lock(mm);
> > +	ret = pin_user_pages(hva, 1, flags, &page, NULL);
> 
> When I implemented memory pinning via GUP for the KVM SPE series, I
> discovered that the pages were regularly unmapped at stage 2 because of
> automatic numa balancing, as change_prot_numa() ends up calling
> mmu_notifier_invalidate_range_start().
> 
> I was curious how you managed to avoid that, I don't know my way around
> pKVM and can't seem to find where that's implemented.

With this series, we don't take any notice of the MMU notifiers at EL2
so the stage-2 remains intact. The GUP pin will prevent the page from
being migrated as the rmap walker won't be able to drop the mapcount.

It's functional, but we'd definitely like to do better in the long term.
The fd-based approach that I mentioned in the cover letter gets us some of
the way there for protected guests ("private memory"), but non-protected
guests running under pKVM are proving to be pretty challenging (we need to
deal with things like sharing the zero page...).

Will