[PATCH v2 00/13] KVM: Introduce KVM Userfault

Thu May 29 08:28:26 PDT 2025

On Wed, May 28, 2025, James Houghton wrote:
> The only thing that I want to call out again is that this UAPI works
> great for when we are going from userfault --> !userfault. That is, it
> works well for postcopy (both for guest_memfd and for standard
> memslots where userfaultfd scalability is a concern).
> 
> But there is another use case worth bringing up: unmapping pages that
> the VMM is emulating as poisoned.
> 
> Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for
> 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains
> mapped in userspace), KVM Userfault is the only option (if we don't
> want to punch holes in memslots). This leaves us with three problems:
> 
> 1. If using KVM Userfault to emulate poison, we are stuck with small
> pages in stage 2 for the entire memslot.
> 2. We must unmap everything when toggling on KVM Userfault just to
> unmap a single page.
> 3. If KVM Userfault is already enabled, we have no choice but to
> toggle KVM Userfault off and on again to unmap the newly poisoned
> pages (i.e., there is no ioctl to scan the bitmap and unmap
> newly-userfault pages).
> 
> All of these are non-issues if we emulate poison by removing memslots,
> and I think that's possible. But if that proves too slow, we'd need to
> be a little bit more clever with hugepage recovery and with unmapping
> newly-userfault pages, both of which I think can be solved by adding
> some kind of bitmap re-scan ioctl. We can do that later if the need
> arises.

Hmm.

On the one hand, punching a hole in a memslot is generally gross, e.g. requires
deleting the entire memslot and thus unmapping large swaths of guest memory (or
all of guest memory for most x86 VMs).

On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM
will need to unmap guest memory (or demote the mapping size a la eager page
splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1.

One thought would be to change the behavior of KVM's processing of the userfault
bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead
give userspace more explicit control over the mapping size.  However, on non-x86
architectures, implementing such a control would require a non-trivial amount of
code and complexity, and would incur overhead that doesn't exist today (i.e. we'd
need to implement equivalent infrastructure to x86's disallow_lpage tracking).

And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for
KVM accesses to guest memory.  E.g. if the HugeTLB page is still mapped into
userspace, then depending on the flow that gets hit, I'm pretty sure that emulating
an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas
punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO.

All in all, given that KVM needs to correctly handle hugepage vs. memslot
alignment/size issues no matter what, and that KVM has well-established behavior
for handling no-memslot accesses, I'm leaning towards saying userspace should
punch a hole in the memslot in order to emulate a poisoned page.  The only reason
I can think of for preferring a different approach is if userspace can't provide
the desired latency/performance characteristics when punching a hole in a memslot.
Hopefully reacting to a poisoned page is a fairly slow path?