[RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

Mon Jul 15 08:25:36 PDT 2024

On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> This patch series implements the KVM-based demand paging system that was
> first introduced back in November[1] by David Matlack.
> 
> The working name for this new system is KVM Userfault, but that name is very
> confusing so it will not be the final name.
> 
Hi James,
I had implemented a similar approach for TDX post-copy migration, there are quite
some differences though. Got some questions about your design below.

> Problem: post-copy with guest_memfd
> ===================================
> 
> Post-copy live migration makes it possible to migrate VMs from one host to
> another no matter how fast they are writing to memory while keeping the VM
> paused for a minimal amount of time. For post-copy to work, we
> need:
>  1. to be able to prevent KVM from being able to access particular pages
>     of guest memory until we have populated it  2. for userspace to know when
> KVM is trying to access a particular
>     page.
>  3. a way to allow the access to proceed.
> 
> Traditionally, post-copy live migration is implemented using userfaultfd, which
> hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> PFN translations (with GUP) or when it itself attempts to access guest memory.
> Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> 
> Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> access guest memory will block the same way.
> 
> However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> (nor is there an intermediate HVA).
> 
> So userfaultfd in its current form cannot be used to support post-copy live
> migration with guest_memfd-backed VMs.
> 
> Solution: hook into the gfn -> pfn translation
> ==============================================
> 
> The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> system would be to introduce the concept of a file-userfault[2] to intercept
> faults on a guest_memfd.
> 
> Instead, we take the simpler approach of adding a KVM-specific API, and we
> hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> memslots and for guest_memfd respectively).


Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
pages (i.e. GFN -> HVA)? 
It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
shared pages to go through the existing userfaultfd mechanism:
- The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
- The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
  faults exiting to userspace for postcopy might not be necessary, because all pages on the
  destination side are initially “shared,” and the guest’s first access will always cause an
  exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
  fetch the page data from the source (VMM can know if a page data has been fetched
  from the source or not).

> 
> I have intentionally added support for traditional memslots, as the complexity
> that it adds is minimal, and it is useful for some VMMs, as it can be used to
> fully implement post-copy live migration.
> 
> Implementation Details
> ======================
> 
> Let's break down how KVM implements each of the three core requirements
> for implementing post-copy as laid out above:
> 
> --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> 
> The most straightforward way to inform KVM of userfault-enabled pages is to
> use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> 
> There is already infrastructure in place for modifying and checking memory
> attributes. Using this interface is slightly challenging, as there is no UAPI for
> setting/clearing particular attributes; we must set the exact attributes we want.
> 
> The synchronization that is in place for updating memory attributes is not
> suitable for post-copy live migration either, which will require updating
> memory attributes (from userfault to no-userfault) very frequently.
> 
> Another potential interface could be to use something akin to a dirty bitmap,
> where a bitmap describes which pages within a memslot (or VM) should trigger
> userfaults. This way, it is straightforward to make updates to the userfault
> status of a page cheap.
> 
> When KVM Userfault is enabled, we need to be careful not to map a userfault
> page in response to a fault on a non-userfault page. In this RFC, I've taken the
> simplest approach: force new PTEs to be PAGE_SIZE.
> 
> --- Page fault notifications ---
> 
> For page faults generated by vCPUs running in guest mode, if the page the
> vCPU is trying to access is a userfault-enabled page, we use

Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
Any functional issues if we just have all the page faults exit to userspace during the
post-copy period?
- As also mentioned above, userspace can easily know if a page needs to be
  fetched from the source or not, so upon a fault exit to userspace, VMM can
  decide to block the faulting vcpu thread or return back to KVM immediately.
- If improvement is really needed (would need profiling first) to reduce number
  of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
  Each page only needs to exit to userspace once for the purpose of fetching its data
  from the source in postcopy. It doesn't seem to need userspace to enable the exit
  again for the page (via a new uAPI), right?