Pmemfs/guestmemfs discussion recap and open questions
David Rientjes
rientjes at google.com
Wed Oct 16 21:42:04 PDT 2024
Hi all,
We had a very interesting discussion today led by James Gowans in the
Linux MM Alignment Session, thank you James! And thanks to everybody who
attended and provided great questions, suggestions, and feedback.
Guestmemfs[*] is proposed to provide an in-memory persistent filesystem
primarily aimed at Kexec Hand-Over (KHO) use cases: 1GB allocations, no
struct pages, unmapped from the kernel direct map. The memory for this
filesystem is set aside by the memblock allocator as defined by the
kernel command line (like guestmemfs=900G on a 1TB system).
----->o-----
Feedback from David Hildenbrand was that we may want to leverge HVO
to get struct page savings and the alignment was to define this as
part of the filesystem configuration: do you want all struct pages to
be gone and memory unmapped from the kernel direct map, or in the
kernel direct map with tail pages freed for I/O? You get to choose!
----->o-----
It was noted that the premise for guestmemfs sounded very similar to
guest_memfd, a filesystem that would index non-anonymous guest_memfds;
indeed, this is not dissimilar to persistent guest_memfd. The new
kernel would need to present the fds to userspace so they can
be used once again, so a filesystem abstraction may make sense. We
may also want to use uid and gid permissions.
It's highly desirable for the kernel to share the same infrastructure and
source code, like struct page optimizations and unmapping from the kernel
direct map, and name the guest_memfd. We'd want to avoid duplicating
this, but it's still questionable how this would be glued together.
David Hildenbrand brought up the idea of a persistent filesystem that
even databases could use that may not be guest_memfd. Persistent
filesystems do exist, but lack the 1GB memory allocation requirement; if
we were to support databases or other workloads that want to persist
memory across kexec, this instead would become a new optimized filesystem
for generic use cases that require persistence. Mike Rapoport noted that
tying the ability to persist memory across kexec to only guests would
preclude this without major changes.
Frank van der Linden noted the abstraction between guest_memfd and
guestmemfs doesn't mesh very well and we may want to do this at the
allocator level instead: basically a factory that gives you exactly what
you want -- memory unmapped from the kernel direct map, with HVO instead,
etc.
Jason Gunthorpe noted there's a desire to add iommufd connections to
guest_memfd and that would have to be duplicated for guestmemfs. KVM has
special connections to it, ioctls, etc. So likely a whole new API
surface is coming around guest_memfd that guestmemfs will want to re-use.
To support this, it was also noted that guest_memfd is largely used for
confidential computing and pKVM today, and confidential computing is a
requirement for cloud providers: they need to expose guest_memfd style
interface for such VMs as well.
Jason suggested that when you create a file on the filesystem, you tell
it exactly what you want: unmapped memory, guest_memfd semantics, or just
a plain file. James expanded on this by brainstorming an API for such
use cases and backed by this new kind of allocator to provide exactly
what you need.
----->o-----
James also noted some users are interested in smaller regions of memory
that aren't preallocated, like tmpfs, so there is interest in "persistent
tmpfs," including dynamic sizing. This may be tricky because tmpfs uses
page cache. In this case, the preallocation would not be needed. Mike
Rapoport noted the same is the case for memory mapped into the kernel
direct map which is not required for persistence (including if you want to
do I/O).
The tricky part of this is to determine what should and should not be
solved with the same solution. Is it acceptable to have something like
guestmemfs which is very specific to cloud providers running VMs in most
of their host memory?
Matthew Wilcox noted there perhaps are ways to support persistence in
tmpfs, such as with swap, for this other use case, James noted this could
be used for things like systemd information that people have brought up
for containerization. He indicated we should ensure KHO can mark tmpfs
pages to be persistent. We'd need to follow up with Alex.
----->o-----
Pasha Tatashin asked about NUMA support with the current guestmemfs
proposal. James noted this would be an essential requirement. When
specifying the kernel command line with guestmemfs=, we could specify
the lengths required from each NUMA node. This would result in per-node
mount points.
----->o-----
Peter Xu asked if IOMMU page tables could be stored on the guestmemfs
themselves to preserve across kexec. James noted previous solutions for
this existed, but were tricky because of filesystem ordering at boot.
This led to the conclusion that if we want persistent devices, then we
need persistent memory as well; only files from guestmemfs that are known
to be persistent can be mapped into a persistent VMA domain. In the case
of IOMMU page tables, the IOMMU driver needs to tell KHO that they must be
persisted.
----->o-----
My takeaway: based on the feedback that was provided in the discussion:
- we need an allocator abstraction for persistent memory that can return
memory with various characteristics: 1GB or not, kernel direct map or
not, HVO or not, etc.
- built on top of that, we need the ability to carve out very large
ranges of memory (cloud provider use case) with NUMA awareness on the
kernel command line
- we also need the ability to be able to dynamically resize this or
provide hints at allocation time that memory must be persisted across
kexec to support the non-cloud provider use case
- we need a filesystem abstraction that map memory of the type that is
requested, including guest_memfd and then deal with all the fun of
multitenancy since it would be drawing from a finite per-NUMA node
pool of persistent memory
- absolutely critical to this discussion is defining what is the core
infrastructure that is required for a generally acceptable solution
and then what builds off of that to be more special cased (like the
cloud provider use case or persistent tmpfs use case)
We're looking to continue that discussion here and then come together
again in a few weeks.
Thanks!
[*] https://lore.kernel.org/kvm/20240805093245.889357-1-jgowans@amazon.com/
More information about the kexec
mailing list