UML for arm64

Sat Jun 24 13:05:50 PDT 2023

On Sat, 2023-06-24 at 15:15 +0200, Johannes Berg wrote:
> On Fri, 2023-06-23 at 16:34 -0600, Rob Herring wrote:
> > 
> > > 
> > > Either way, the old patchset will give you a good idea about how it all
> > > works, the changes are mostly in the details. I am happy to push out a
> > > new version sooner rather than later if it might help with any efforts
> > > on your side.
> > 
> > From a quick scan, it looks like there's some cleanups in the series
> > which would be helpful without seccomp parts. One of the initial
> > issues I've found is UML using older ptrace interfaces which arm64
> > doesn't implement. PTRACE_GETREGS for example.
> > 
> 
> I don't think that completely gets rid of PTRACE_GETREGS though, and if
> I remember correctly, we really kind of need that there?

The SECCOMP code should not need any ptrace at all. All it does is
read/write the mcontext that is generated by the host. I think there
was just some mangling there to map the basic registers into the format
that UML expects internally (floating point, SSE, etc. are just copied
directly though).

> Though then again it's all been a while, and I only faulted the seccomp
> mode back in in discussions with Benjamin. Looks like we've found a
> potentially nicer way to make it secure than his secret-based approach,
> and in fact in a way that should even make it SMP-safe, at least in
> theory, obviously a lot of infrastructure is missing to make it SMP in
> the first place.

Yeah, the idea with FD passing and CLONE_VM without CLONE_FILES does
indeed seem very promising both for a secure SECCOMP model and SMP
support specifically. It is should be much easier to implement than my
previous secret based syscall authentication idea.

Benjamin

> 
> Currently, UML has a host process per VMA. Obviously, you need multiple
> host processes for SMP (to get SMP), i.e. one per (used) CPU per VMA,
> with CLONE_VM.
> 
> The problem with the secrets-based approach here for SMP is that the
> secret will be readable to the other running in the VMA (1) and then can
> be used for circumventing the protection by jumping into the stub area
> and calling host syscalls, see
> https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-28-benjamin@sipsolutions.net/
> and
> https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-24-benjamin@sipsolutions.net/
> 
> 
> Now the new idea we came up with is this: We can make the per-CPU VMA
> with CLONE_VM but *not* CLONE_FILES. Then, in the stub, when we need to
> execute some real host syscalls on behalf of the child, we
> 
>  * send the FD over in a message
>  * use the FD for mmap (and also always use mmap instead of mprotect)
>  * close the FD
> 
> Without CLONE_FILES, another thread cannot "steal" the (real, host) FD,
> it's useless in the other thread. Note that I'm talking about host FDs
> here, in-UML FDs are just numbers and it works all differently, I'm just
> talking about executing host syscalls inside the VMA, to set up the VMA
> correctly etc.
> 
> The BPF program now allows any the relevant syscalls (2) inside the stub
> area, but since you don't have the FD unless you actually executed the
> recvmsg() call at the beginning of the stub you can't do anything with
> that by jumping into the stub.
> 
> There are some other details involved such as having to split the stub
> data into a read-only "what to execute" (3) and a writeable "results"
> page, but those are reasonably easy to deal with.
> 
> At the end of the stub, of course the FD must be closed. This is fine
> though since mappings persist after the FD was closed. On the next page
> fault we have to do it all again, but yeah, page faults were always
> super expensive in UML ...
> 
> johannes
> 
> 
> (1) unless you extract it directly out of the BPF program into registers
> via some magic syscalls that get a BPF return, but that's kind of icky
> too
> 
> (2) mmap & munmap are the most relevant ones, some others but they're
> not that critical for security; notably mprotect is not allowed but must
> be done with mmap instead
> 
> (3) so another thread can't actually overwrite the instructions of what
> the kernel wants to run inside the VMA process from another thread while
> it's happening
>