[PATCH 0/7] um: skas: harden the seccomp userspace stub

Tue Jun 23 15:08:36 PDT 2026

On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
>
> On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > From: Cong Wang <cwang at multikernel.io>
> >
> > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > list could not be addressed by the filter alone:
> >
> >   - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> >     intra-guest disclosure and, on this base (single physmem fd, no
> >     kernel/user split), a host escape; and
> >
> >   - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> >     evade preemption and wedge the monitor indefinitely.
> >
> > This series closes both:
> >
> >   1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> >        owned by the monitor (no behavioural change yet).
> >   3-4: validate each mmap() against the mm's page table -- allowed iff the
> >        PTE already maps the requested frame with no more access than it
> >        grants -- including out-of-batch mmaps a hijacked stub issues on
> >        its own.
> >   5:   route and validate munmap() the same way (range-confined below
> >        STUB_START).
>
>
> That approach seems odd to me. Adding an explicit out-of-band check
> means you require two extra context switches per mmap syscall. I would
> expect that this makes the SECCOMP approach a lot slower than ptrace().
> My take is still that it is possible to carefully craft a SECCOMP
> filter as well as stub/kernel code that makes exploitation impossible
> for non-SMP.
>
> The true SMP case is more complicated, but we do not have that anyway,
> so I would not worry about it for now.
>
> Did you run any performance tests?

Sorry, I thought I included them in patch 4/7, but I missed them.

Here they are:

mmap() -- ns/call
-----------------
size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
4K     13056      10368         +2688  (+26%)     17408      15104
16K    13824      11008         +2816  (+26%)     15616      12544
64K    19840      15616         +4224  (+27%)     21760      17408
256K   33152      28928         +4224  (+15%)     36352      31744
1M     95616      84608         +11008 (+13%)     117504     108032

munmap() -- ns/call
-------------------
size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
4K     11008      8448          +2560  (+30%)     13568      11264
16K    10752      8448          +2304  (+27%)     12288      9472
64K    12032      9728          +2304  (+24%)     13824      10752
256K   15360      12800         +2560  (+20%)     17152      14592
1M     27648      24832         +2816  (+11%)     30976      26880

Since this is a clear security-vs-performance balance, I picked security
over performance. Please let me know if you prefer otherwise.

Meanwhile, after a second thought, there is a zero-overhead solution.
pidfd_mmap(), which can be built on top of another patch of mine:
https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/

Something like:

int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
        int phys_fd, unsigned long long offset)
{
        return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
                          prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
                          phys_fd, offset);
}
int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
{
        return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
}

Obviously, this one requires a new syscall for the host kernel first.

>
> >   6:   add a watchdog thread that detects a stub which stops reporting
> >        back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> >        recover via the existing teardown.
>
> That also seems like an odd solution to me. Architecturally, UML first
> receives the SIGALRM and forwards it to the child. It would seem much
> easier to set a flag and clear it again when the process reports back
> that it received the SIGALRM. Then, when the kernel receives the next
> SIGALRM, just kill the child immediately if the flag is still set.

The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
child" is accurate there. This patch is on the seccomp path
(wait_stub_done_seccomp), which is architecturally different.

Or am I missing anything?

Thanks for your review.

Regards,
Cong