[PATCH 0/7] um: skas: harden the seccomp userspace stub

Wed Jun 24 00:54:49 PDT 2026

On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> > 
> > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > From: Cong Wang <cwang at multikernel.io>
> > > 
> > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > list could not be addressed by the filter alone:
> > > 
> > >   - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > >     intra-guest disclosure and, on this base (single physmem fd, no
> > >     kernel/user split), a host escape; and
> > > 
> > >   - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > >     evade preemption and wedge the monitor indefinitely.
> > > 
> > > This series closes both:
> > > 
> > >   1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > >        owned by the monitor (no behavioural change yet).
> > >   3-4: validate each mmap() against the mm's page table -- allowed iff the
> > >        PTE already maps the requested frame with no more access than it
> > >        grants -- including out-of-batch mmaps a hijacked stub issues on
> > >        its own.
> > >   5:   route and validate munmap() the same way (range-confined below
> > >        STUB_START).
> > 
> > 
> > That approach seems odd to me. Adding an explicit out-of-band check
> > means you require two extra context switches per mmap syscall. I would
> > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > My take is still that it is possible to carefully craft a SECCOMP
> > filter as well as stub/kernel code that makes exploitation impossible
> > for non-SMP.
> > 
> > The true SMP case is more complicated, but we do not have that anyway,
> > so I would not worry about it for now.
> > 
> > Did you run any performance tests?
> 
> Sorry, I thought I included them in patch 4/7, but I missed them.
> 
> Here they are:
> 
> mmap() -- ns/call
> -----------------
> size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> 4K     13056      10368         +2688  (+26%)     17408      15104
> 16K    13824      11008         +2816  (+26%)     15616      12544
> 64K    19840      15616         +4224  (+27%)     21760      17408
> 256K   33152      28928         +4224  (+15%)     36352      31744
> 1M     95616      84608         +11008 (+13%)     117504     108032
> 
> munmap() -- ns/call
> -------------------
> size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> 4K     11008      8448          +2560  (+30%)     13568      11264
> 16K    10752      8448          +2304  (+27%)     12288      9472
> 64K    12032      9728          +2304  (+24%)     13824      10752
> 256K   15360      12800         +2560  (+20%)     17152      14592
> 1M     27648      24832         +2816  (+11%)     30976      26880
> 
> Since this is a clear security-vs-performance balance, I picked security
> over performance. Please let me know if you prefer otherwise.

So, I still believe that (almost) zero-overhead is possible in the
current framework. Maybe one could add a non-zero overhead option, but
then I would like to be able to disable it as there are plenty of
scenarios where security is of no concern.

Also, I am unsure what you measured there. When you write "mmap" here,
is that an "mmap" call made by the userspace process (which might not
actually change the page mappings), or is that an "mmap" call that was
queued for the stub to process?

A large amount of the overhead that UML tends to have happens when
handling both minor and major page faults. Simple tests for this is for
example a relatively tight fork/exec loop or just applications starting
up. In our test runs, we saw major speed improvements (>10%) just by
fixing bugs in the TLB handling code that resulted in fewer page faults
happening at application runtime.

> Meanwhile, after a second thought, there is a zero-overhead solution.
> pidfd_mmap(), which can be built on top of another patch of mine:
> https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/

Sure, should some sort of method land that allows directly modifying
the mappings of the remote process, that would be rather convenient. 

> Something like:
> 
> int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
>         int phys_fd, unsigned long long offset)
> {
>         return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
>                           prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
>                           phys_fd, offset);
> }
> int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> {
>         return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> }
> 
> Obviously, this one requires a new syscall for the host kernel first.
> 
> > 
> > >   6:   add a watchdog thread that detects a stub which stops reporting
> > >        back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > >        recover via the existing teardown.
> > 
> > That also seems like an odd solution to me. Architecturally, UML first
> > receives the SIGALRM and forwards it to the child. It would seem much
> > easier to set a flag and clear it again when the process reports back
> > that it received the SIGALRM. Then, when the kernel receives the next
> > SIGALRM, just kill the child immediately if the flag is still set.
> 
> The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> child" is accurate there. This patch is on the seccomp path
> (wait_stub_done_seccomp), which is architecturally different.
> 
> Or am I missing anything?

The fact that it is not actually different from an architectural
standpoint. In both cases SIGALRM is reported back to the kernel. It
obviously has to be reported back, otherwise scheduling would not work.

Benjamin

> 
> Thanks for your review.
> 
> Regards,
> Cong
>