[PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Mon Feb 8 16:26:05 EST 2021

On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt at linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.

> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."

> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.

> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.

-- 
Sincerely yours,
Mike.