[RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

Wed Jul 17 18:09:21 PDT 2024

On Wed, Jul 17, 2024 at 8:03 AM Wang, Wei W <wei.w.wang at intel.com> wrote:
>
> On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> > You're right that, today, including support for guest-private memory
> > *only* indeed simplifies things (no async userfaults). I think your strategy for
> > implementing post-copy would work (so, shared->private conversion faults for
> > vCPU accesses to private memory, and userfaultfd for everything else).
>
> Yes, it works and has been used for our internal tests.
>
> >
> > I'm not 100% sure what should happen in the case of a non-vCPU access to
> > should-be-private memory; today it seems like KVM just provides the shared
> > version of the page, so conventional use of userfaultfd shouldn't break
> > anything.
>
> This seems to be the trusted IO usage (not aware of other usages, emulated device
> backends, such as vhost, work with shared pages). Migration support for trusted device
> passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
> AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
> you've made it internally). So it seems too early to discuss this in detail.

We don't migrate VMs with passthrough devices.

I still think the way KVM handles non-vCPU accesses to private memory
is wrong: surely it is an error, yet we simply provide the shared
version of the page. *shrug*

>
> >
> > But eventually guest_memfd itself will support "shared" memory,
>
> OK, I thought of this. Not sure how feasible it would be to extend gmem for
> shared memory. I think questions like below need to be investigated:

An RFC for it got posted recently[1]. :)

> #1 what are the tangible benefits of gmem based shared memory, compared to the
>      legacy shared memory that we have now?

For [1], unmapping guest memory from the direct map.

> #2 There would be some gaps to make gmem usable for shared pages. For
>       example, would it support userspace to map (without security concerns)?

At least in [1], userspace would be able to mmap it, but KVM would
still not be able to GUP it (instead going through the normal
guest_memfd path).

> #3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
>      in the same issue as hugetlb?

Good question. At the end of the day, the problem is that GUP relies
on host mm page table mappings, and HugeTLB can't map things with
PAGE_SIZE PTEs.

At least as of [1], given that KVM doesn't GUP guest_memfd memory, we
don't rely on the host mm page table layout, so we don't have the same
problem.

For VMMs that want to catch userspace (or non-GUP kernel) accesses via
a guest_memfd VMA, then it's possible it has the same issue. But for
VMMs that don't care to catch these kinds of accesses (the kind of
user that would use KVM Userfault to implement post-copy), it doesn't
matter.

[1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/

>
> The support of using gmem for shared memory isn't in place yet, and this seems
> to be a dependency for the support being added here.

Perhaps I've been slightly preemptive. :) I still think there's useful
discussion here.

> > and
> > (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> > anyway). For a non-confidential VM, all memory will be "shared", so shared-
> > >private conversions can't help us there either.
> > Starting everything as private almost works (so using private->shared
> > conversions as a notification mechanism), but if the first time KVM attempts to
> > use a page is not from a vCPU (and is from a place where we cannot easily
> > return to userspace), the need for "async userfaults"
> > comes back.
>
> Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
> pages only, this wouldn't be an issue (it will be covered by userfaultfd).

We're on the same page here.

>
>
> >
> > For this use case, it seems cleaner to have a new interface. (And, as far as I can
> > tell, we would at least need some kind of "async userfault"-like mechanism.)
> >
> > Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> > couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> > unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> > here[1][2].
> >
> > Moving the implementation of post-copy to KVM means that, throughout
> > post-copy, we can avoid changes to the main mm page tables, and we only
> > need to modify the second stage page tables. This saves the memory needed
> > to store the extra set of shattered page tables, and we save the performance
> > overhead of the page table modifications and accounting that mm does.
>
> It would be nice to see some data for comparisons between kvm faults and userfaultfd
> e.g., end to end latency of handling a page fault via getting data from the source.
> (I didn't find data from the link you shared. Please correct me if I missed it)

I don't have an A/B comparison for kernel end-to-end fault latency. :(
But I can tell you that with 32us or so network latency, it's not a
huge difference (assuming Anish's series[2]).

The real performance issue comes when we are collapsing the page
tables at the end. We basically have to do ~2x of everything (TLB
flushes, etc.), plus additional accounting that HugeTLB/THP does
(adjusting refcount/mapcount), etc. And one must optimize how the
unmap MMU notifiers are called so as to not stall vCPUs unnecessarily.

[2]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>
> > We don't necessarily need a way to go from no-fault -> fault for a page, that's
> > right[4]. But we do need a way for KVM to be able to allow the access to
> > proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> > userspace, we need a way to tell KVM not to do that again.
> > In the case of shared->private conversions, that mechanism is toggling the memory
> > attributes for a gfn.  For conventional userfaultfd, that's using
> > UFFDIO_COPY/CONTINUE/POISON.
> > Maybe I'm misunderstanding your question.
>
> We can come back to this after the dependency discussion above is done. (If gmem is only
> used for private pages, the support for postcopy, including changes required for VMMs, would
> be simpler)