[PATCH 33/89] KVM: arm64: Handle guest stage-2 page-tables entirely at EL2

Tue Jun 7 18:16:56 PDT 2022

On 6/1/2022 12:45 AM, Will Deacon wrote:
> On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
>> On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
>>> Now that EL2 is able to manage guest stage-2 page-tables, avoid
>>> allocating a separate MMU structure in the host and instead introduce a
>>> new fault handler which responds to guest stage-2 faults by sharing
>>> GUP-pinned pages with the guest via a hypercall. These pages are
>>> recovered (and unpinned) on guest teardown via the page reclaim
>>> hypercall.
>>>
>>> Signed-off-by: Will Deacon <will at kernel.org>
>>> ---
>> [..]
>>> +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>> +			  unsigned long hva)
>>> +{
>>> +	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
>>> +	struct mm_struct *mm = current->mm;
>>> +	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
>>> +	struct kvm_pinned_page *ppage;
>>> +	struct kvm *kvm = vcpu->kvm;
>>> +	struct page *page;
>>> +	u64 pfn;
>>> +	int ret;
>>> +
>>> +	ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
>>> +	if (ret)
>>> +		return -ENOMEM;
>>> +
>>> +	ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
>>> +	if (!ppage)
>>> +		return -ENOMEM;
>>> +
>>> +	ret = account_locked_vm(mm, 1, true);
>>> +	if (ret)
>>> +		goto free_ppage;
>>> +
>>> +	mmap_read_lock(mm);
>>> +	ret = pin_user_pages(hva, 1, flags, &page, NULL);
>>
>> When I implemented memory pinning via GUP for the KVM SPE series, I
>> discovered that the pages were regularly unmapped at stage 2 because of
>> automatic numa balancing, as change_prot_numa() ends up calling
>> mmu_notifier_invalidate_range_start().
>>
>> I was curious how you managed to avoid that, I don't know my way around
>> pKVM and can't seem to find where that's implemented.
> 
> With this series, we don't take any notice of the MMU notifiers at EL2
> so the stage-2 remains intact. The GUP pin will prevent the page from
> being migrated as the rmap walker won't be able to drop the mapcount.
> 
> It's functional, but we'd definitely like to do better in the long term.
> The fd-based approach that I mentioned in the cover letter gets us some of
> the way there for protected guests ("private memory"), but non-protected
> guests running under pKVM are proving to be pretty challenging (we need to
> deal with things like sharing the zero page...).
> 
> Will

My understanding is that with the pin_user_pages, the page that used by 
guests (both protected and non-protected) will stay for a long time, and 
the page will not be swapped or migrated. So no need to care about the 
MMU notifiers. Is it right?