[RFC PATCH 00/13] nommu UML

Mon Oct 28 06:32:43 PDT 2024

Hello Hajime,

On Sun, 2024-10-27 at 18:10 +0900, Hajime Tazaki wrote:
> thank you for your time looking at this.
> 
> On Sat, 26 Oct 2024 19:19:08 +0900,
> Benjamin Berg wrote:
> 
> > > - a crash on userspace programs crashes a UML kernel, not signaling
> > >   with SIGSEGV to the program.
> > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> > >   a vma structure for our case, which updates the internal procedure
> > >   of maple_tree subsystem.  We're trying to fix issue but still a
> > >   random process on exit(2) crashes.
> > 
> > Btw. are you handling FP register save/restore? If it is not there, it
> > probably would not be too hard to add (XSAVE, etc.), though it might
> > add a bit of additional overhead. Especially as UML always saves the FP
> > state rather than optimizing it like the x86 architectures.
> 
> The patch handles fp register on entry/leave at syscall; [07/13] patch
> contains this part.

That looks like FS/GS registers which are for thread-local storage. I
was talking about floating point registers. Maybe you meant another
patch?

> I'm not familiar with that but what kind of optimizations does x86
> architecture do for fp register handling ?

The kernel does not usually need the FP registers. So it optimizes the
pretty common case of a userspace -> kernel -> userspace switch that
happens for a syscall by simply not saving/restoring these registers at
all.

Obviously, it then still needs to do the work when the task is switched
or in the rare case that the kernel wants to use floating point itself.

> > I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> > requirement on patching userspace code also seems like a lot.
> > 
> > To me, it seems much more natural to catch the userspace syscalls using
> > a SECCOMP filter[1]. While quite a lot slower, that should be much more
> > portable across architectures. For improved speed one could still do
> > architecture specific things inside the vDSO or by using zpoline. But
> > those would then "just" be optimizations and unpatched code would still
> > work correctly (e.g. JIT).
> 
> I'm not proposing this patch to replace existing UML implementations;
> for instance, the patchset cannot run CONFIG_MMU code in the whole
> kernel tree so, existing ptrace-based implementation still has real
> usecase.  and ptrace based syscall hook is not indeed fast and the
> improvements with seccomp filter instead clearly has benefits.  I
> think it's independent to this patchset.

Of course. nommu mode is a completely independent feature.

I am still wondering a bit about the users for such a mode. It is not
interesting for us as we use it for testing. Of course, speed is nice
but it is not the primary objective.

I understand that it can be an approach for a small "container", but
then you would need a very strict SECCOMP filter for the kernel itself.

> So I think while your seccomp patches are also in review, this
> patchset can exist in parallel.
> 
> btw, though I mentioned that JIT generated code is not currently
> handled in a different reply, it can be implemented as an extension to
> this patchset; the original implementation of zpoline now is able to
> patch JIT generated code as well.
> 
> https://github.com/yasukata/zpoline/pull/20/commits/c42af16757ad3fcdf7084c9f2139bb9105796873
> 
> it is not implemented for the moment.
> 
> in terms of the portability, the basic idea of syscall hook with
> zpoline is applicable to other platform, like aarch64
> (https://github.com/retrage/svc-hook).  so I believe it has a chance
> to expand this idea to other architectures than x86_64.

Right, aarch64 is probably the most interesting one in general. At
least there was some interest in a UML port.

> > For me, a big argument in favour of such an approach is its simplicity.
> > I am mostly basing that on the fact that this patchset should properly
> > handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> > you will already have all the infrastructure to do the correct register
> > save/restore using the host mcontex, which is what is needed in the
> > SIGSYS handler when using SECCOMP. The filter itself should be simple
> > as it just needs to catch all syscalls within valid userspace
> > executable memory[2] ranges.
> 
> I agree with your observation that the approach is simple.
> I don't have a good idea on how to handle SIGSEGV, but will try to see
> with your inputs.

You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to
get/set state for SECCOMP" for getting the registers and also writing
them back if you want to restore using rt_sigreturn.

> > [1] Maybe not surprising, as I have been working on a SECCOMP based UML
> > that does not require ptrace.
> 
> yes, I'm aware of it since before.  I have also conducted a benchmark
> with several hook mechanisms, including seccomp with simple getpid
> measurement.
> 
> https://speakerdeck.com/thehajime/netdev0x18-zpoline?slide=16

Sure! I saw that :-)

> > [2] I am assuming that userspace executable code is already confined to
> > a certain address space within the UML process. Obviously, the kernel
> > itself and loaded modules need to be free to do host syscalls and
> > should not be affected by the SECCOMP filter.
> 
> I think our !MMU UML doesn't break this assumption.  But did you see
> something to our patchset ?

I also assume that is fine. One just needs to understand this when
writing a SECCOMP filter for syscall emulation in nommu mode.

Benjamin