[PATCH 0/7] um: skas: harden the seccomp userspace stub
Cong Wang
xiyou.wangcong at gmail.com
Tue Jun 23 15:08:36 PDT 2026
On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
>
> On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > From: Cong Wang <cwang at multikernel.io>
> >
> > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > list could not be addressed by the filter alone:
> >
> > - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > intra-guest disclosure and, on this base (single physmem fd, no
> > kernel/user split), a host escape; and
> >
> > - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > evade preemption and wedge the monitor indefinitely.
> >
> > This series closes both:
> >
> > 1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > owned by the monitor (no behavioural change yet).
> > 3-4: validate each mmap() against the mm's page table -- allowed iff the
> > PTE already maps the requested frame with no more access than it
> > grants -- including out-of-batch mmaps a hijacked stub issues on
> > its own.
> > 5: route and validate munmap() the same way (range-confined below
> > STUB_START).
>
>
> That approach seems odd to me. Adding an explicit out-of-band check
> means you require two extra context switches per mmap syscall. I would
> expect that this makes the SECCOMP approach a lot slower than ptrace().
> My take is still that it is possible to carefully craft a SECCOMP
> filter as well as stub/kernel code that makes exploitation impossible
> for non-SMP.
>
> The true SMP case is more complicated, but we do not have that anyway,
> so I would not worry about it for now.
>
> Did you run any performance tests?
Sorry, I thought I included them in patch 4/7, but I missed them.
Here they are:
mmap() -- ns/call
-----------------
size WITH med WITHOUT med delta WITH p99 WITHOUT p99
4K 13056 10368 +2688 (+26%) 17408 15104
16K 13824 11008 +2816 (+26%) 15616 12544
64K 19840 15616 +4224 (+27%) 21760 17408
256K 33152 28928 +4224 (+15%) 36352 31744
1M 95616 84608 +11008 (+13%) 117504 108032
munmap() -- ns/call
-------------------
size WITH med WITHOUT med delta WITH p99 WITHOUT p99
4K 11008 8448 +2560 (+30%) 13568 11264
16K 10752 8448 +2304 (+27%) 12288 9472
64K 12032 9728 +2304 (+24%) 13824 10752
256K 15360 12800 +2560 (+20%) 17152 14592
1M 27648 24832 +2816 (+11%) 30976 26880
Since this is a clear security-vs-performance balance, I picked security
over performance. Please let me know if you prefer otherwise.
Meanwhile, after a second thought, there is a zero-overhead solution.
pidfd_mmap(), which can be built on top of another patch of mine:
https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
Something like:
int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
int phys_fd, unsigned long long offset)
{
return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
phys_fd, offset);
}
int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
{
return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
}
Obviously, this one requires a new syscall for the host kernel first.
>
> > 6: add a watchdog thread that detects a stub which stops reporting
> > back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > recover via the existing teardown.
>
> That also seems like an odd solution to me. Architecturally, UML first
> receives the SIGALRM and forwards it to the child. It would seem much
> easier to set a flag and clear it again when the process reports back
> that it received the SIGALRM. Then, when the kernel receives the next
> SIGALRM, just kill the child immediately if the flag is still set.
The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
child" is accurate there. This patch is on the seccomp path
(wait_stub_done_seccomp), which is architecturally different.
Or am I missing anything?
Thanks for your review.
Regards,
Cong
More information about the linux-um
mailing list