[PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Michal Hocko mhocko at suse.com
Mon Feb 15 14:20:45 EST 2021

On Mon 15-02-21 10:14:43, James Bottomley wrote:
> On Mon, 2021-02-15 at 10:13 +0100, Michal Hocko wrote:
> > On Sun 14-02-21 11:21:02, James Bottomley wrote:
> > > On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote:
> > > [...]
> > > > > And here we come to the question "what are the differences that
> > > > > justify a new system call?" and the answer to this is very
> > > > > subjective. And as such we can continue bikeshedding forever.
> > > > 
> > > > I think this fits into the existing memfd_create() syscall just
> > > > fine, and I heard no compelling argument why it shouldn‘t. That‘s
> > > > all I can say.
> > > 
> > > OK, so let's review history.  In the first two incarnations of the
> > > patch, it was an extension of memfd_create().  The specific
> > > objection by Kirill Shutemov was that it doesn't share any code in
> > > common with memfd and so should be a separate system call:
> > > 
> > > https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/
> > 
> > Thanks for the pointer. But this argument hasn't been challenged at
> > all. It hasn't been brought up that the overlap would be considerable
> > higher by the hugetlb/sealing support. And so far nobody has claimed
> > those combinations as unviable.
> Kirill is actually interested in the sealing path for his KVM code so
> we took a look.  There might be a two line overlap in memfd_create for
> the seal case, but there's no real overlap in memfd_add_seals which is
> the bulk of the code.  So the best way would seem to lift the inode ...
> -> seals helpers to be non-static so they can be reused and roll our
> own add_seals.

These are implementation details which are not really relevant to the

> I can't see a use case at all for hugetlb support, so it seems to be a
> bit of an angels on pin head discussion.  However, if one were to come
> along handling it in the same way seems reasonable.

Those angels have made their way to mmap, System V shm, memfd_create and
other MM interfaces which have never envisioned when introduced. Hugetlb
pages to back guest memory is quite a common usecase so why do you think
those guests wouldn't like to see their memory be "secret"?

As I've said in my last response (YCZEGuLK94szKZDf at dhcp22.suse.cz), I am
not going to argue all these again. I have made my point and you are
free to take it or leave it.

> > > The other objection raised offlist is that if we do use
> > > memfd_create, then we have to add all the secret memory flags as an
> > > additional ioctl, whereas they can be specified on open if we do a
> > > separate system call.  The container people violently objected to
> > > the ioctl because it can't be properly analysed by seccomp and much
> > > preferred the syscall version.
> > > 
> > > Since we're dumping the uncached variant, the ioctl problem
> > > disappears but so does the possibility of ever adding it back if we
> > > take on the container peoples' objection.  This argues for a
> > > separate syscall because we can add additional features and extend
> > > the API with flags without causing anti-ioctl riots.
> > 
> > I am sorry but I do not understand this argument.
> You don't understand why container guarding technology doesn't like
> ioctls?

No, I did not see where the ioctl argument came from.


> >  What kind of flags are we talking about and why would that be a
> > problem with memfd_create interface? Could you be more specific
> > please?
> You mean what were the ioctl flags in the patch series linked above? 

OK I see. How many potential modes are we talking about? A few or
potentially many?

> They were eventually dropped after v10, because of problems with
> architectural semantics, with the idea that it could be added back
> again if a compelling need arose:
> https://lore.kernel.org/linux-api/20201123095432.5860-1-rppt@kernel.org/
> In theory the extra flags could be multiplexed into the memfd_create
> flags like hugetlbfs is but with 32 flags and a lot already taken it
> gets messy for expansion.  When we run out of flags the first question
> people will ask is "why didn't you do separate system calls?".

OK, I do not necessarily see a lack of flag space a problem. I can be
wrong here but I do not see how that would be solved by a separate
syscall when it sounds rather forseeable that many modes supported by
memfd_create will eventually find their way to a secret memory as well.
If for no other reason, secret memory is nothing really special. It is
just a memory which is not mapped to the kernel via 1:1 mapping. That's
it. And that can be applied to any memory provided to the userspace.

But I am repeating myself again here so I better stop.
Michal Hocko

More information about the linux-riscv mailing list