[RFC PATCH v3 0/6] Direct Map Removal for guest_memfd

Mon Nov 4 13:30:53 PST 2024

>> We talked about shared (faultable) vs. private (unfaultable), and how it
>> would interact with the directmap patches here.
>>
>> As discussed, having private (unfaultable) memory with the direct-map
>> removed and shared (faultable) memory with the direct-mapping can make
>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>> the discussion here seems to indicate that it might currently not be
>> required.
>>
>> So one thing we could do is that shared (faultable) will have a direct
>> mapping and be gup-able and private (unfaultable) memory will not have a
>> direct mapping and is, by design, not gup-able.>
>> Maybe it could make sense to not have a direct map for all guest_memfd
>> memory, making it behave like secretmem (and it would be easy to
>> implement)? But I'm not sure if that is really desirable in VM context.
> 
> This would work for us (in this scenario, the swiotlb areas would be
> "traditional" memory, e.g. set to shared via mem attributes instead of
> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
> series (well, we'd need to figure out how to get the mappings of gmem
> back into KVM, since in this setup, short-circuiting it into
> userspace_addr wouldn't work, unless we banish swiotlb into a different
> memslot altogether somehow).

Right.

> But I don't think it'd work for pKVM, iirc
> they need GUP on gmem, and also want direct map removal (... but maybe,
> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
> behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
> GUP+no faults of private mem).

Good question. So far my perception was that the directmap removal on 
"private/unfaultable" would be sufficient.

> 
>> Having a mixture of "has directmap" and "has no directmap" for shared
>> (faultable) memory should not be done. Similarly, private memory really
>> should stay "unfaultable".
> 
> You've convinced me that having both GUP-able and non GUP-able
> memory in the same VMA will be tricky. However, I'm less convinced on
> why private memory should stay unfaultable; only that it shouldn't be
> faultable into a VMA that also allows GUP. Can we have two VMAs? One
> that disallows GUP, but allows userspace access to shared and private,
> and one that allows GUP, but disallows accessing private memory? Maybe
> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
> different spin of the above idea.

What we are trying to achieve is making guest_memfd not behave 
completely different on that level for different "types" of VMs. So one 
of the goals should be to try to unify it as much as possible.

shared -> faultable: GUP-able
private -> unfaultable: unGUP-able

And it makes sense, because a lot of future work will rely on some 
important properties: for example, if private memory cannot be faulted 
in + GUPed, core-MM will never have obtained valid references to such a 
page. There is no need to split large folios into smaller ones for 
tracking purposes; there is no need to maintain per-page refcounts and 
pincounts ...

It doesn't mean that we cannot consider it if really required, but there 
really has to be a strong case for it, because it will all get really messy.

For example, one issue is that a folio only has a single mapping 
(folio->mapping), and that is used in the GUP-fast path (no VMA) to 
determine whether GUP-fast is allowed or not.

So you'd have to force everything through GUP-slow, where you could 
consider VMA properties :( It sounds quite suboptimal.

I don't think multiple VMAs are what we really want. See below.

> 
>> I think one of the points raised during the bi-weekly call was that
>> using a viommu/swiotlb might be the right call, such that all memory can
>> be considered private (unfaultable) that is not explicitly
>> shared/expected to be modified by the hypervisor (-> faultable, ->
>> GUP-able).
>>
>> Further, I think Sean had some good points why we should explore that
>> direction, but I recall that there were some issue to be sorted out
>> (interpreted instructions requiring direct map when accessing "private"
>> memory?), not sure if that is already working/can be made working in KVM.
> 
> Yeah, the big one is MMIO instruction emulation on x86, which does guest
> page table walks and instruction fetch (and particularly the latter
> cannot be known ahead-of-time by the guest, aka cannot be explicitly
> "shared"). That's what the majority of my v2 series was about. For
> traditional memslots, KVM handles these via get_user and friends, but if
> we don't have a VMA that allows faulting all of gmem, then that's
> impossible, and we're in "temporarily restore direct map" land. Which
> comes with significantly performance penalties due to TLB flushes.

Agreed.

 > >> What's your opinion after the call and the next step for use cases 
like
>> you have in mind (IIRC firecracker, which wants to not have the
>> direct-map for guest memory where it can be avoided)?
> 
> Yea, the usecase is for Firecracker to not have direct map entries for
> guest memory, unless needed for I/O (-> swiotlb).
> 
> As for next steps, let's determine once and for all if we can do the
> KVM-internal guest memory accesses for MMIO emulation through userspace
> mappings (although if we can't I'll have some serious soul-searching to
> do, because all other solutions we talked about so far also have fairly
> big drawbacks; on-demand direct map reinsertion has terrible
> performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using 
fd+offset, and that's problematic because "no-directmap".

So you'd have to map+unmap the directmap repeatedly, and still expose it 
temporarily in the direct map to others. I see how that is undesirable, 
even when trying to cache hotspots (partly destroying the purpose of the 
directmap removal).

Would a per-MM kernel mapping of these pages work, so KVM can access them?

It sounds a bit like what is required for clean per-MM allocations [1]: 
establish a per-MM kernel mapping of (selected?) pages. Not necessarily 
all of them.

Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything 
involved with ordinary user mappings for these private/unfaultable 
thingies. Just like as discussed in, and similar to [1].

Just throwing it out there, maybe we really want to avoid the directmap 
(keep it unmapped) and maintain a per-mm mapping for a bunch of folios 
that can be easily removed when required by guest_memfd (ftruncate, 
conversion private->shared) on request.

[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u

-- 
Cheers,

David / dhildenb