[PATCH 33/89] KVM: arm64: Handle guest stage-2 page-tables entirely at EL2

Wed Jul 27 23:50:12 PDT 2022

On 7/27/2022 5:59 PM, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Jun 08, 2022 at 09:16:56AM +0800, Huang, Shaoqin wrote:
>>
>> On 6/1/2022 12:45 AM, Will Deacon wrote:
>>> On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
>>>> On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
>>>>> Now that EL2 is able to manage guest stage-2 page-tables, avoid
>>>>> allocating a separate MMU structure in the host and instead introduce a
>>>>> new fault handler which responds to guest stage-2 faults by sharing
>>>>> GUP-pinned pages with the guest via a hypercall. These pages are
>>>>> recovered (and unpinned) on guest teardown via the page reclaim
>>>>> hypercall.
>>>>>
>>>>> Signed-off-by: Will Deacon <will at kernel.org>
>>>>> ---
>>>> [..]
>>>>> +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>> +			  unsigned long hva)
>>>>> +{
>>>>> +	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
>>>>> +	struct mm_struct *mm = current->mm;
>>>>> +	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
>>>>> +	struct kvm_pinned_page *ppage;
>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>> +	struct page *page;
>>>>> +	u64 pfn;
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
>>>>> +	if (ret)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
>>>>> +	if (!ppage)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	ret = account_locked_vm(mm, 1, true);
>>>>> +	if (ret)
>>>>> +		goto free_ppage;
>>>>> +
>>>>> +	mmap_read_lock(mm);
>>>>> +	ret = pin_user_pages(hva, 1, flags, &page, NULL);
>>>>
>>>> When I implemented memory pinning via GUP for the KVM SPE series, I
>>>> discovered that the pages were regularly unmapped at stage 2 because of
>>>> automatic numa balancing, as change_prot_numa() ends up calling
>>>> mmu_notifier_invalidate_range_start().
>>>>
>>>> I was curious how you managed to avoid that, I don't know my way around
>>>> pKVM and can't seem to find where that's implemented.
>>>
>>> With this series, we don't take any notice of the MMU notifiers at EL2
>>> so the stage-2 remains intact. The GUP pin will prevent the page from
>>> being migrated as the rmap walker won't be able to drop the mapcount.
>>>
>>> It's functional, but we'd definitely like to do better in the long term.
>>> The fd-based approach that I mentioned in the cover letter gets us some of
>>> the way there for protected guests ("private memory"), but non-protected
>>> guests running under pKVM are proving to be pretty challenging (we need to
>>> deal with things like sharing the zero page...).
>>>
>>> Will
>>
>> My understanding is that with the pin_user_pages, the page that used by
>> guests (both protected and non-protected) will stay for a long time, and the
>> page will not be swapped or migrated. So no need to care about the MMU
>> notifiers. Is it right?
> 
> There are two things here.
> 
> First, pinning a page means making the data persistent in memory. From
> Documentation/core-api/pin_user_pages.rst:
> 
> "FOLL_PIN is a *replacement* for FOLL_GET, and is for short term pins on
> pages whose data *will* get accessed. As such, FOLL_PIN is a "more severe"
> form of pinning. And finally, FOLL_LONGTERM is an even more restrictive
> case that has FOLL_PIN as a prerequisite: this is for pages that will be
> pinned longterm, and whose data will be accessed."
> 
> It does not mean that the translation table entry for the page is not
> modified for as long as the pin exists. In the example I gave, automatic
> NUMA balancing changes the protection of translation table entries to
> PAGE_NONE, which will invoke the MMU notifers to unmap the corresponding
> stage 2 entries, regardless of the fact that the pinned pages will not get
> migrated the next time they are accessed.
> 
> There are other mechanisms in the kernel that do that, for example
> split_huge_pmd(), which must always succeed, even if the THP is pinned (it
> transfers the refcounts among the pages): "Note that split_huge_pmd()
> doesn't have any limitations on refcounting: pmd can be split at any point
> and never fails" (Documentation/vm/transhuge.rst, also see
> __split_huge_pmd() from mm/huge_memory.c).
> 
> KSM also does that: it invokes the invalidate_range_start MMU notifier
> before backing out of the merge because of the refcount (see mm/ksm.c::
> try_to_merge_one_page -> write_protect_page).
> 
> This brings me to my second point: one might rightfully ask themselves (I
> did!), why not invoke the MMU notifiers *after* checking that the page is
> not pinned? It turns out that that is not reliable, because the refcount is
> increased by GUP with the page lock held (which is a spinlock), but by
> their design the invalidate_range_start MMU notifiers must be called from
> interruptible + preemptible context. The only way to avoid races would be
> to call the MMU notifier while holding the page table lock, which is
> impossible.
> 
> Hope my explanation has been adequate.
> 
> Thanks,
> Alex

Thanks for your clear explanation.