[RFC PATCH v3 0/6] Direct Map Removal for guest_memfd

Mon Nov 4 05:09:53 PST 2024

Hi David,

On 11/4/24 12:18, David Hildenbrand wrote:
> On 31.10.24 11:42, Patrick Roy wrote:
>> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
>>> On 30.10.24 14:49, Patrick Roy wrote:
>>>> Unmapping virtual machine guest memory from the host kernel's direct map
>>>> is a successful mitigation against Spectre-style transient execution
>>>> issues: If the kernel page tables do not contain entries pointing to
>>>> guest memory, then any attempted speculative read through the direct map
>>>> will necessarily be blocked by the MMU before any observable
>>>> microarchitectural side-effects happen. This means that Spectre-gadgets
>>>> and similar cannot be used to target virtual machine memory. Roughly 60%
>>>> of speculative execution issues fall into this category [1, Table 1].
>>>>
>>>> This patch series extends guest_memfd with the ability to remove its
>>>> memory from the host kernel's direct map, to be able to attain the above
>>>> protection for KVM guests running inside guest_memfd.
>>>>
>>>> === Changes to v2 ===
>>>>
>>>> - Handle direct map removal for physically contiguous pages in arch code
>>>>     (Mike R.)
>>>> - Track the direct map state in guest_memfd itself instead of at the
>>>>     folio level, to prepare for huge pages support (Sean C.)
>>>> - Allow configuring direct map state of not-yet faulted in memory
>>>>     (Vishal A.)
>>>> - Pay attention to alignment in ftrace structs (Steven R.)
>>>>
>>>> Most significantly, I've reduced the patch series to focus only on
>>>> direct map removal for guest_memfd for now, leaving the whole "how to do
>>>> non-CoCo VMs in guest_memfd" for later. If this separation is
>>>> acceptable, then I think I can drop the RFC tag in the next revision
>>>> (I've mainly kept it here because I'm not entirely sure what to do with
>>>> patches 3 and 4).
>>>
>>> Hi,
>>>
>>> keeping upcoming "shared and private memory in guest_memfd" in mind, I
>>> assume the focus would be to only remove the direct map for private memory?
>>>
>>> So in the current upstream state, you would only be removing the direct
>>> map for private memory, currently translating to "encrypted"/"protected"
>>> memory that is inaccessible either way already.
>>>
>>> Correct?
>>
>> Yea, with the upcomming "shared and private" stuff, I would expect the
>> the shared<->private conversions would call the routines from patch 3 to
>> restore direct map entries on private->shared, and zap them on
>> shared->private.
> 
> I wanted to follow-up to the discussion we had in the bi-weekly call.

Thanks for summarizing!

> We talked about shared (faultable) vs. private (unfaultable), and how it
> would interact with the directmap patches here.
> 
> As discussed, having private (unfaultable) memory with the direct-map
> removed and shared (faultable) memory with the direct-mapping can make
> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
> the discussion here seems to indicate that it might currently not be
> required.
>
> So one thing we could do is that shared (faultable) will have a direct
> mapping and be gup-able and private (unfaultable) memory will not have a
> direct mapping and is, by design, not gup-able.> 
> Maybe it could make sense to not have a direct map for all guest_memfd
> memory, making it behave like secretmem (and it would be easy to
> implement)? But I'm not sure if that is really desirable in VM context.

This would work for us (in this scenario, the swiotlb areas would be
"traditional" memory, e.g. set to shared via mem attributes instead of
"shared" inside KVM), it's kinda what I had prototyped in my v1 of this
series (well, we'd need to figure out how to get the mappings of gmem
back into KVM, since in this setup, short-circuiting it into
userspace_addr wouldn't work, unless we banish swiotlb into a different
memslot altogether somehow). But I don't think it'd work for pKVM, iirc
they need GUP on gmem, and also want direct map removal (... but maybe,
the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
GUP+no faults of private mem).

> Having a mixture of "has directmap" and "has no directmap" for shared
> (faultable) memory should not be done. Similarly, private memory really
> should stay "unfaultable".

You've convinced me that having both GUP-able and non GUP-able
memory in the same VMA will be tricky. However, I'm less convinced on
why private memory should stay unfaultable; only that it shouldn't be
faultable into a VMA that also allows GUP. Can we have two VMAs? One
that disallows GUP, but allows userspace access to shared and private,
and one that allows GUP, but disallows accessing private memory? Maybe
via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
different spin of the above idea.

> I think one of the points raised during the bi-weekly call was that
> using a viommu/swiotlb might be the right call, such that all memory can
> be considered private (unfaultable) that is not explicitly
> shared/expected to be modified by the hypervisor (-> faultable, ->
> GUP-able).
> 
> Further, I think Sean had some good points why we should explore that
> direction, but I recall that there were some issue to be sorted out
> (interpreted instructions requiring direct map when accessing "private"
> memory?), not sure if that is already working/can be made working in KVM.

Yeah, the big one is MMIO instruction emulation on x86, which does guest
page table walks and instruction fetch (and particularly the latter
cannot be known ahead-of-time by the guest, aka cannot be explicitly
"shared"). That's what the majority of my v2 series was about. For
traditional memslots, KVM handles these via get_user and friends, but if
we don't have a VMA that allows faulting all of gmem, then that's
impossible, and we're in "temporarily restore direct map" land. Which
comes with significantly performance penalties due to TLB flushes.

> What's your opinion after the call and the next step for use cases like
> you have in mind (IIRC firecracker, which wants to not have the
> direct-map for guest memory where it can be avoided)?

Yea, the usecase is for Firecracker to not have direct map entries for
guest memory, unless needed for I/O (-> swiotlb).

As for next steps, let's determine once and for all if we can do the
KVM-internal guest memory accesses for MMIO emulation through userspace
mappings (although if we can't I'll have some serious soul-searching to
do, because all other solutions we talked about so far also have fairly
big drawbacks; on-demand direct map reinsertion has terrible
performance, protection keys would limit us to 15 VMs on the host, and
the page table swapping runs into problems with NMIs if I understood
Sean correctly last Thursday :( ).

> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick