[RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT

Fri Jul 26 11:00:40 PDT 2024

On Fri, Jul 26, 2024 at 9:50 AM Nikita Kalyazin <kalyazin at amazon.com> wrote:
>
> Hi James,
>
> On 11/07/2024 00:42, James Houghton wrote:
> > It is possible that KVM wants to access a userfault-enabled GFN in a
> > path where it is difficult to return out to userspace with the fault
> > information. For these cases, add a mechanism for KVM to wait for a GFN
> > to not be userfault-enabled.
> In this patch series, an asynchronous notification mechanism is used
> only in cases "where it is difficult to return out to userspace with the
> fault information". However, we (AWS) have a use case where we would
> like to be notified asynchronously about _all_ faults. Firecracker can
> restore a VM from a memory snapshot where the guest memory is supplied
> via a Userfaultfd by a process separate from the VMM itself [1]. While
> it looks technically possible for the VMM process to handle exits via
> forwarding the faults to the other process, that would require building
> a complex userspace protocol on top and likely introduce extra latency
> on the critical path.
> This also implies that a KVM API
> (KVM_READ_USERFAULT) is not suitable, because KVM checks that the ioctls
> are performed specifically by the VMM process [2]:
>         if (kvm->mm != current->mm || kvm->vm_dead)
>                 return -EIO;

If it would be useful, we could absolutely have a flag to have all
faults go through the asynchronous mechanism. :) It's meant to just be
an optimization. For me, it is a necessary optimization.

Userfaultfd doesn't scale particularly well: we have to grab two locks
to work with the wait_queues. You could create several userfaultfds,
but the underlying issue is still there. KVM Userfault, if it uses a
wait_queue for the async fault mechanism, will have the same
bottleneck. Anish and I worked on making userfaults more scalable for
KVM[1], and we ended up with a scheme very similar to what we have in
this KVM Userfault series.

My use case already requires using a reasonably complex API for
interacting with a separate userland process for fetching memory, and
it's really fast. I've never tried to hook userfaultfd into this other
process, but I'm quite certain that [1] + this process's interface
scale better than userfaultfd does. Perhaps userfaultfd, for
not-so-scaled-up cases, could be *slightly* faster, but I mostly care
about what happens when we scale to hundreds of vCPUs.

[1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>  > The implementation of this mechanism is certain to change before KVM
>  > Userfault could possibly be merged.
> How do you envision resolving faults in userspace? Copying the page in
> (provided that userspace mapping of guest_memfd is supported [3]) and
> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
> sufficient to resolve the fault because an attempt to copy the page
> directly in userspace will trigger a fault on its own

This is not true for KVM Userfault, at least for right now. Userspace
accesses to guest memory will not trigger KVM Userfaults. (I know this
name is terrible -- regular old userfaultfd() userfaults will indeed
get triggered, provided you've set things up properly.)

KVM Userfault is merely meant to catch KVM's own accesses to guest
memory (including vCPU accesses). For non-guest_memfd memslots,
userspace can totally just write through the VMA it has made (KVM
Userfault *cannot*, by virtue of being completely divorced from mm,
intercept this access). For guest_memfd, userspace could write to
guest memory through a VMA if that's where guest_memfd is headed, but
perhaps it will rely on exact details of how userspace is meant to
populate guest_memfd memory.

You're totally right that, in essence, we will need some kind of
non-faulting way to interact with guest memory. With traditional
memslots and VMAs, we have that already; guest_memfd memslots and
VMAs, I think we will have that eventually.

> and may lead to a
> deadlock in the case where the original fault was caused by the VMM. An
> interface similar to UFFDIO_COPY is needed that would allocate a page,
> copy the content in and update page tables.

In case it's interesting or useful at all, we actually use
UFFDIO_CONTINUE for our live migration use case. We mmap() memory
twice -- one of them we register with userfaultfd and also give to
KVM. The other one we use to install memory -- our non-faulting view
of guest memory!

>
> [1] Firecracker snapshot restore via UserfaultFD:
> https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
> [2] KVM ioctl check for the address space:
> https://elixir.bootlin.com/linux/v6.10.1/source/virt/kvm/kvm_main.c#L5083
> [3] mmap() of guest_memfd:
> https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
>
> Thanks,
> Nikita

Thanks for the feedback!