[RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

Wed Jul 17 18:59:02 PDT 2024

On Thu, Jul 11, 2024 at 4:37 PM David Matlack <dmatlack at google.com> wrote:
>
> On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton at google.com> wrote:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is
> > to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking
> > memory attributes. Using this interface is slightly challenging, as there
> > is no UAPI for setting/clearing particular attributes; we must set the
> > exact attributes we want.
>
> The thing we'll want to optimize specifically is clearing
> ATTRIBUTE_USERFAULT. During post-copy migration, there will be
> potentially hundreds of vCPUs in a single VM concurrently
> demand-fetching memory. Clearing ATTRIBUTE_USERFAULT for each page
> fetched is on the critical path of getting the vCPU back into
> guest-mode.
>
> Clearing ATTRIBUTE_USERFAULT just needs to clear the attribute. It
> doesn't need to modify page tables or update any data structures other
> than the attribute itself. But the existing UAPI takes both mmu_lock
> and slots_lock IIRC.
>
> I'm also concerned that the existing UAPI could lead to userspace
> accidentally clearing ATTRIBUTE_USERFAULT when it goes to set
> ATTRIBUTE_PRIVATE (or any other potential future attribute). Sure that
> could be solved but that means centrally tracking attributes in
> userspace and issuing one ioctl per contiguous region of guest memory
> with matching attributes. Imagine a scenario where ~every other page
> of guest memory as ATTRIBUTE_USERFAULT and then userspace wants to set
> a differient attribute on a large region of memory. That's going to
> take a _lot_ of ioctls.
>
> Having a UAPI to set (attributes |= delta) and clear (attributes &=
> ~delta) attributes on a range of GFNs would solve both these problems.

Hi David, sorry for the delay getting back to you.

I agree with all of these points.

>
> >
> > The synchronization that is in place for updating memory attributes is
> > not suitable for post-copy live migration either, which will require
> > updating memory attributes (from userfault to no-userfault) very
> > frequently.
>
> There is also the xarray. I imagine that will trigger a lot of dynamic
> memory allocations during post-copy which will slow increase the total
> time a vCPU is paused due to a USERFAULT page.
>
> Is it feasible to convert attributes to a bitmap?

I don't see any reason why we couldn't convert attributes to be a
bitmap (or to have some attributes be stored in bitmaps and others be
stored in the xarray).

>
> >
> > Another potential interface could be to use something akin to a dirty
> > bitmap, where a bitmap describes which pages within a memslot (or VM)
> > should trigger userfaults. This way, it is straightforward to make
> > updates to the userfault status of a page cheap.
>
> Taking a similar approach to dirty logging is attractive for several reasons.
>
> 1. The infrastructure to manage per-memslot bitmaps already exists for
> dirty logging.
> 2. It avoids the performance problems with xarrays by using a bitmap.
> 3. It avoids the performance problems with setting all attributes at once.
>
> However it will require new specific UAPIs to set/clear. And it's
> probably possible to optimize attributes to meet our needs, and those
> changes will benefit all attributes.

Ok so the three options in my head are:
1. Add an attribute diff UAPI and track the USERFAULT attribute in the xarray.
2. Add an attribute diff UAPI and track the USERFAULT attribute with a bitmap.
3. Add a new UAPI to enable KVM userfaults on gfns according to a
particular bitmap, similar to dirty logging.

(1) is problematic because it is valid to have every page (or, say,
every other page) have ATTRIBUTE_USERFAULT.

(2) seems ok to me.

(3) would be great, but maybe the much more complicated UAPI is not
worth it. (We get the ability to mark many different regions as
USERFAULT in one syscall, and KVM has a lot of code for handling
bitmap arguments.)

I'm hoping others will weigh in here.

> >
> > When KVM Userfault is enabled, we need to be careful not to map a
> > userfault page in response to a fault on a non-userfault page. In this
> > RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page
> > the vCPU is trying to access is a userfault-enabled page, we use
> > KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.
> >
> > For arm64, I believe this is actually all we need, provided we handle
> > steal_time properly.
>
> There's steal time, and also the GIC pages. Steal time can use
> KVM_EXIT_MEMORY_FAULT, but that requires special casing in the ARM
> code. Alternatively, both can use the async mechanism and to avoid
> special handling in the ARM code.

Oh, of course, I forgot about the GIC. Thanks. And yes, if the async
userfault mechanism is acceptable, using that would be better than
adding the special cases.

>
> >
> > For x86, where returning from deep within the instruction emulator (or
> > other non-trivial execution paths) is infeasible, being able to pause
> > execution while userspace fetches the page, just as userfaultfd would
> > do, is necessary. Let's call these "asynchronous userfaults."
> >
> > A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
> > userfaults, and an eventfd is used to signal that new faults are
> > available for reading.
> >
> > Today, we busy-wait for a gfn to have userfault disabled. This will
> > change in the future.
> >
> > --- Fault resolution ---
> >
> > Resolving userfaults today is as simple as removing the USERFAULT memory
> > attribute on the faulting gfn. This will change if we do not end up
> > using memory attributes for KVM Userfault. Having a range-based wake-up
> > like userfaultfd (see UFFDIO_WAKE) might also be helpful for
> > performance.
> >
> > Problems with this series
> > =========================
> > - This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
> > - Memory attribute modification doesn't scale well.
> > - We busy-wait for pages to not be userfault-enabled.
>
> Async faults are a slow path so I think a wait queue would suffice.

I think a wait queue seems like a good fit too. (It's what userfaultfd uses.)

>
> > - gfn_to_hva and gfn_to_pfn caches are not invalidated.
> > - Page tables are not collapsed when KVM Userfault is disabled.
> > - There is no self-test for asynchronous userfaults.
> > - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
>
> Userspace would probably treat this as fatal anyway right?

Yes, but I still think dropping the gfn isn't great. I'll fix this
when I change from using the hacky list-based thing to something more
sophisticated (like a wait_queue).