UML for arm64

Sat Jun 24 06:15:34 PDT 2023

On Fri, 2023-06-23 at 16:34 -0600, Rob Herring wrote:
> 
> > 
> > Either way, the old patchset will give you a good idea about how it all
> > works, the changes are mostly in the details. I am happy to push out a
> > new version sooner rather than later if it might help with any efforts
> > on your side.
> 
> From a quick scan, it looks like there's some cleanups in the series
> which would be helpful without seccomp parts. One of the initial
> issues I've found is UML using older ptrace interfaces which arm64
> doesn't implement. PTRACE_GETREGS for example.
> 

I don't think that completely gets rid of PTRACE_GETREGS though, and if
I remember correctly, we really kind of need that there?

Though then again it's all been a while, and I only faulted the seccomp
mode back in in discussions with Benjamin. Looks like we've found a
potentially nicer way to make it secure than his secret-based approach,
and in fact in a way that should even make it SMP-safe, at least in
theory, obviously a lot of infrastructure is missing to make it SMP in
the first place.

Currently, UML has a host process per VMA. Obviously, you need multiple
host processes for SMP (to get SMP), i.e. one per (used) CPU per VMA,
with CLONE_VM.

The problem with the secrets-based approach here for SMP is that the
secret will be readable to the other running in the VMA (1) and then can
be used for circumventing the protection by jumping into the stub area
and calling host syscalls, see
https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-28-benjamin@sipsolutions.net/
and
https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-24-benjamin@sipsolutions.net/

Now the new idea we came up with is this: We can make the per-CPU VMA
with CLONE_VM but *not* CLONE_FILES. Then, in the stub, when we need to
execute some real host syscalls on behalf of the child, we

 * send the FD over in a message
 * use the FD for mmap (and also always use mmap instead of mprotect)
 * close the FD

Without CLONE_FILES, another thread cannot "steal" the (real, host) FD,
it's useless in the other thread. Note that I'm talking about host FDs
here, in-UML FDs are just numbers and it works all differently, I'm just
talking about executing host syscalls inside the VMA, to set up the VMA
correctly etc.

The BPF program now allows any the relevant syscalls (2) inside the stub
area, but since you don't have the FD unless you actually executed the
recvmsg() call at the beginning of the stub you can't do anything with
that by jumping into the stub.

There are some other details involved such as having to split the stub
data into a read-only "what to execute" (3) and a writeable "results"
page, but those are reasonably easy to deal with.

At the end of the stub, of course the FD must be closed. This is fine
though since mappings persist after the FD was closed. On the next page
fault we have to do it all again, but yeah, page faults were always
super expensive in UML ...

johannes

(1) unless you extract it directly out of the BPF program into registers
via some magic syscalls that get a BPF return, but that's kind of icky
too

(2) mmap & munmap are the most relevant ones, some others but they're
not that critical for security; notably mprotect is not allowed but must
be done with mmap instead

(3) so another thread can't actually overwrite the instructions of what
the kernel wants to run inside the VMA process from another thread while
it's happening