[PATCH 33/89] KVM: arm64: Handle guest stage-2 page-tables entirely at EL2
Huang, Shaoqin
shaoqin.huang at intel.com
Wed Jul 27 23:50:12 PDT 2022
On 7/27/2022 5:59 PM, Alexandru Elisei wrote:
> Hi,
>
> On Wed, Jun 08, 2022 at 09:16:56AM +0800, Huang, Shaoqin wrote:
>>
>> On 6/1/2022 12:45 AM, Will Deacon wrote:
>>> On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
>>>> On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
>>>>> Now that EL2 is able to manage guest stage-2 page-tables, avoid
>>>>> allocating a separate MMU structure in the host and instead introduce a
>>>>> new fault handler which responds to guest stage-2 faults by sharing
>>>>> GUP-pinned pages with the guest via a hypercall. These pages are
>>>>> recovered (and unpinned) on guest teardown via the page reclaim
>>>>> hypercall.
>>>>>
>>>>> Signed-off-by: Will Deacon <will at kernel.org>
>>>>> ---
>>>> [..]
>>>>> +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>> + unsigned long hva)
>>>>> +{
>>>>> + struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
>>>>> + struct mm_struct *mm = current->mm;
>>>>> + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
>>>>> + struct kvm_pinned_page *ppage;
>>>>> + struct kvm *kvm = vcpu->kvm;
>>>>> + struct page *page;
>>>>> + u64 pfn;
>>>>> + int ret;
>>>>> +
>>>>> + ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
>>>>> + if (ret)
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
>>>>> + if (!ppage)
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + ret = account_locked_vm(mm, 1, true);
>>>>> + if (ret)
>>>>> + goto free_ppage;
>>>>> +
>>>>> + mmap_read_lock(mm);
>>>>> + ret = pin_user_pages(hva, 1, flags, &page, NULL);
>>>>
>>>> When I implemented memory pinning via GUP for the KVM SPE series, I
>>>> discovered that the pages were regularly unmapped at stage 2 because of
>>>> automatic numa balancing, as change_prot_numa() ends up calling
>>>> mmu_notifier_invalidate_range_start().
>>>>
>>>> I was curious how you managed to avoid that, I don't know my way around
>>>> pKVM and can't seem to find where that's implemented.
>>>
>>> With this series, we don't take any notice of the MMU notifiers at EL2
>>> so the stage-2 remains intact. The GUP pin will prevent the page from
>>> being migrated as the rmap walker won't be able to drop the mapcount.
>>>
>>> It's functional, but we'd definitely like to do better in the long term.
>>> The fd-based approach that I mentioned in the cover letter gets us some of
>>> the way there for protected guests ("private memory"), but non-protected
>>> guests running under pKVM are proving to be pretty challenging (we need to
>>> deal with things like sharing the zero page...).
>>>
>>> Will
>>
>> My understanding is that with the pin_user_pages, the page that used by
>> guests (both protected and non-protected) will stay for a long time, and the
>> page will not be swapped or migrated. So no need to care about the MMU
>> notifiers. Is it right?
>
> There are two things here.
>
> First, pinning a page means making the data persistent in memory. From
> Documentation/core-api/pin_user_pages.rst:
>
> "FOLL_PIN is a *replacement* for FOLL_GET, and is for short term pins on
> pages whose data *will* get accessed. As such, FOLL_PIN is a "more severe"
> form of pinning. And finally, FOLL_LONGTERM is an even more restrictive
> case that has FOLL_PIN as a prerequisite: this is for pages that will be
> pinned longterm, and whose data will be accessed."
>
> It does not mean that the translation table entry for the page is not
> modified for as long as the pin exists. In the example I gave, automatic
> NUMA balancing changes the protection of translation table entries to
> PAGE_NONE, which will invoke the MMU notifers to unmap the corresponding
> stage 2 entries, regardless of the fact that the pinned pages will not get
> migrated the next time they are accessed.
>
> There are other mechanisms in the kernel that do that, for example
> split_huge_pmd(), which must always succeed, even if the THP is pinned (it
> transfers the refcounts among the pages): "Note that split_huge_pmd()
> doesn't have any limitations on refcounting: pmd can be split at any point
> and never fails" (Documentation/vm/transhuge.rst, also see
> __split_huge_pmd() from mm/huge_memory.c).
>
> KSM also does that: it invokes the invalidate_range_start MMU notifier
> before backing out of the merge because of the refcount (see mm/ksm.c::
> try_to_merge_one_page -> write_protect_page).
>
> This brings me to my second point: one might rightfully ask themselves (I
> did!), why not invoke the MMU notifiers *after* checking that the page is
> not pinned? It turns out that that is not reliable, because the refcount is
> increased by GUP with the page lock held (which is a spinlock), but by
> their design the invalidate_range_start MMU notifiers must be called from
> interruptible + preemptible context. The only way to avoid races would be
> to call the MMU notifier while holding the page table lock, which is
> impossible.
>
> Hope my explanation has been adequate.
>
> Thanks,
> Alex
Thanks for your clear explanation.
More information about the linux-arm-kernel
mailing list