[PATCH 0/7] um: skas: harden the seccomp userspace stub
Benjamin Berg
benjamin at sipsolutions.net
Wed Jun 24 00:54:49 PDT 2026
On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> >
> > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > From: Cong Wang <cwang at multikernel.io>
> > >
> > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > list could not be addressed by the filter alone:
> > >
> > > - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > > intra-guest disclosure and, on this base (single physmem fd, no
> > > kernel/user split), a host escape; and
> > >
> > > - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > > evade preemption and wedge the monitor indefinitely.
> > >
> > > This series closes both:
> > >
> > > 1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > > owned by the monitor (no behavioural change yet).
> > > 3-4: validate each mmap() against the mm's page table -- allowed iff the
> > > PTE already maps the requested frame with no more access than it
> > > grants -- including out-of-batch mmaps a hijacked stub issues on
> > > its own.
> > > 5: route and validate munmap() the same way (range-confined below
> > > STUB_START).
> >
> >
> > That approach seems odd to me. Adding an explicit out-of-band check
> > means you require two extra context switches per mmap syscall. I would
> > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > My take is still that it is possible to carefully craft a SECCOMP
> > filter as well as stub/kernel code that makes exploitation impossible
> > for non-SMP.
> >
> > The true SMP case is more complicated, but we do not have that anyway,
> > so I would not worry about it for now.
> >
> > Did you run any performance tests?
>
> Sorry, I thought I included them in patch 4/7, but I missed them.
>
> Here they are:
>
> mmap() -- ns/call
> -----------------
> size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> 4K 13056 10368 +2688 (+26%) 17408 15104
> 16K 13824 11008 +2816 (+26%) 15616 12544
> 64K 19840 15616 +4224 (+27%) 21760 17408
> 256K 33152 28928 +4224 (+15%) 36352 31744
> 1M 95616 84608 +11008 (+13%) 117504 108032
>
> munmap() -- ns/call
> -------------------
> size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> 4K 11008 8448 +2560 (+30%) 13568 11264
> 16K 10752 8448 +2304 (+27%) 12288 9472
> 64K 12032 9728 +2304 (+24%) 13824 10752
> 256K 15360 12800 +2560 (+20%) 17152 14592
> 1M 27648 24832 +2816 (+11%) 30976 26880
>
> Since this is a clear security-vs-performance balance, I picked security
> over performance. Please let me know if you prefer otherwise.
So, I still believe that (almost) zero-overhead is possible in the
current framework. Maybe one could add a non-zero overhead option, but
then I would like to be able to disable it as there are plenty of
scenarios where security is of no concern.
Also, I am unsure what you measured there. When you write "mmap" here,
is that an "mmap" call made by the userspace process (which might not
actually change the page mappings), or is that an "mmap" call that was
queued for the stub to process?
A large amount of the overhead that UML tends to have happens when
handling both minor and major page faults. Simple tests for this is for
example a relatively tight fork/exec loop or just applications starting
up. In our test runs, we saw major speed improvements (>10%) just by
fixing bugs in the TLB handling code that resulted in fewer page faults
happening at application runtime.
> Meanwhile, after a second thought, there is a zero-overhead solution.
> pidfd_mmap(), which can be built on top of another patch of mine:
> https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
Sure, should some sort of method land that allows directly modifying
the mappings of the remote process, that would be rather convenient.
> Something like:
>
> int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
> int phys_fd, unsigned long long offset)
> {
> return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
> prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
> phys_fd, offset);
> }
> int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> {
> return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> }
>
> Obviously, this one requires a new syscall for the host kernel first.
>
> >
> > > 6: add a watchdog thread that detects a stub which stops reporting
> > > back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > > recover via the existing teardown.
> >
> > That also seems like an odd solution to me. Architecturally, UML first
> > receives the SIGALRM and forwards it to the child. It would seem much
> > easier to set a flag and clear it again when the process reports back
> > that it received the SIGALRM. Then, when the kernel receives the next
> > SIGALRM, just kill the child immediately if the flag is still set.
>
> The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> child" is accurate there. This patch is on the seccomp path
> (wait_stub_done_seccomp), which is architecturally different.
>
> Or am I missing anything?
The fact that it is not actually different from an architectural
standpoint. In both cases SIGALRM is reported back to the kernel. It
obviously has to be reported back, otherwise scheduling would not work.
Benjamin
>
> Thanks for your review.
>
> Regards,
> Cong
>
More information about the linux-um
mailing list