[RFC PATCH v3 0/6] Direct Map Removal for guest_memfd

Tue Nov 12 06:40:32 PST 2024

Hi David, 

sorry for the late response, I ended up catching the flu last week and
was out of commission for a while :(

On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote:
>>> We talked about shared (faultable) vs. private (unfaultable), and how it
>>> would interact with the directmap patches here.
>>>
>>> As discussed, having private (unfaultable) memory with the direct-map
>>> removed and shared (faultable) memory with the direct-mapping can make
>>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>>> the discussion here seems to indicate that it might currently not be
>>> required.
>>>
>>> So one thing we could do is that shared (faultable) will have a direct
>>> mapping and be gup-able and private (unfaultable) memory will not have a
>>> direct mapping and is, by design, not gup-able.>
>>> Maybe it could make sense to not have a direct map for all guest_memfd
>>> memory, making it behave like secretmem (and it would be easy to
>>> implement)? But I'm not sure if that is really desirable in VM context.
>>
>> This would work for us (in this scenario, the swiotlb areas would be
>> "traditional" memory, e.g. set to shared via mem attributes instead of
>> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
>> series (well, we'd need to figure out how to get the mappings of gmem
>> back into KVM, since in this setup, short-circuiting it into
>> userspace_addr wouldn't work, unless we banish swiotlb into a different
>> memslot altogether somehow).
> 
> Right.

"right" as in, "yes we could do that"? :p

>> But I don't think it'd work for pKVM, iirc
>> they need GUP on gmem, and also want direct map removal (... but maybe,
>> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
>> behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
>> GUP+no faults of private mem).
> 
> Good question. So far my perception was that the directmap removal on
> "private/unfaultable" would be sufficient.
> 
>>
>>> Having a mixture of "has directmap" and "has no directmap" for shared
>>> (faultable) memory should not be done. Similarly, private memory really
>>> should stay "unfaultable".
>>
>> You've convinced me that having both GUP-able and non GUP-able
>> memory in the same VMA will be tricky. However, I'm less convinced on
>> why private memory should stay unfaultable; only that it shouldn't be
>> faultable into a VMA that also allows GUP. Can we have two VMAs? One
>> that disallows GUP, but allows userspace access to shared and private,
>> and one that allows GUP, but disallows accessing private memory? Maybe
>> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
>> different spin of the above idea.
> 
> What we are trying to achieve is making guest_memfd not behave
> completely different on that level for different "types" of VMs. So one
> of the goals should be to try to unify it as much as possible.
> 
> shared -> faultable: GUP-able
> private -> unfaultable: unGUP-able
> 
> 
> And it makes sense, because a lot of future work will rely on some
> important properties: for example, if private memory cannot be faulted
> in + GUPed, core-MM will never have obtained valid references to such a
> page. There is no need to split large folios into smaller ones for
> tracking purposes; there is no need to maintain per-page refcounts and
> pincounts ...
> 
> It doesn't mean that we cannot consider it if really required, but there
> really has to be a strong case for it, because it will all get really messy.
> 
> For example, one issue is that a folio only has a single mapping
> (folio->mapping), and that is used in the GUP-fast path (no VMA) to
> determine whether GUP-fast is allowed or not.
> 
> So you'd have to force everything through GUP-slow, where you could
> consider VMA properties :( It sounds quite suboptimal.
> 
> I don't think multiple VMAs are what we really want. See below.

Ah, okay, I see. Thanks for explaining, this all makes a lot of sense to
me now!

>>
>>> I think one of the points raised during the bi-weekly call was that
>>> using a viommu/swiotlb might be the right call, such that all memory can
>>> be considered private (unfaultable) that is not explicitly
>>> shared/expected to be modified by the hypervisor (-> faultable, ->
>>> GUP-able).
>>>
>>> Further, I think Sean had some good points why we should explore that
>>> direction, but I recall that there were some issue to be sorted out
>>> (interpreted instructions requiring direct map when accessing "private"
>>> memory?), not sure if that is already working/can be made working in KVM.
>>
>> Yeah, the big one is MMIO instruction emulation on x86, which does guest
>> page table walks and instruction fetch (and particularly the latter
>> cannot be known ahead-of-time by the guest, aka cannot be explicitly
>> "shared"). That's what the majority of my v2 series was about. For
>> traditional memslots, KVM handles these via get_user and friends, but if
>> we don't have a VMA that allows faulting all of gmem, then that's
>> impossible, and we're in "temporarily restore direct map" land. Which
>> comes with significantly performance penalties due to TLB flushes.
> 
> Agreed.
> 
>> >> What's your opinion after the call and the next step for use cases
> like
>>> you have in mind (IIRC firecracker, which wants to not have the
>>> direct-map for guest memory where it can be avoided)?
>>
>> Yea, the usecase is for Firecracker to not have direct map entries for
>> guest memory, unless needed for I/O (-> swiotlb).
>>
>> As for next steps, let's determine once and for all if we can do the
>> KVM-internal guest memory accesses for MMIO emulation through userspace
>> mappings (although if we can't I'll have some serious soul-searching to
>> do, because all other solutions we talked about so far also have fairly
>> big drawbacks; on-demand direct map reinsertion has terrible
>> performance
> So IIUC, KVM would have to access "unfaultable" guest_memfd memory using
> fd+offset, and that's problematic because "no-directmap".
> 
> So you'd have to map+unmap the directmap repeatedly, and still expose it
> temporarily in the direct map to others. I see how that is undesirable,
> even when trying to cache hotspots (partly destroying the purpose of the
> directmap removal).
> 
> 
> Would a per-MM kernel mapping of these pages work, so KVM can access them?
> 
> It sounds a bit like what is required for clean per-MM allocations [1]:
> establish a per-MM kernel mapping of (selected?) pages. Not necessarily
> all of them.
> 
> Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything
> involved with ordinary user mappings for these private/unfaultable
> thingies. Just like as discussed in, and similar to [1].
> 
> Just throwing it out there, maybe we really want to avoid the directmap
> (keep it unmapped) and maintain a per-mm mapping for a bunch of folios
> that can be easily removed when required by guest_memfd (ftruncate,
> conversion private->shared) on request.

I remember talking to someone at some point about whether we could reuse
the proc-local stuff for guest memory, but I cannot remember the outcome
of that discussion... (or maybe I just wanted to have a discussion about
it, but forgot to follow up on that thought?).  I guess we wouldn't use
proc-local _allocations_, but rather just set up proc-local mappings of
the gmem allocations that have been removed from the direct map.

I'm wondering, where exactly would be the differences to Sean's idea
about messing with the CR3 register inside KVM to temporarily install
page tables that contain all the gmem stuff, conceptually? Wouldn't we
run into the same interrupt problems that Sean foresaw for the CR3
stuff? (which, admittedly, I still don't quite follow what these are :(
).

(I've cc'd Fares Mehanna as well)

> [1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

Best,
Patrick