[PATCH 0/7] um: skas: harden the seccomp userspace stub
Benjamin Berg
benjamin at sipsolutions.net
Sat Jun 27 09:21:41 PDT 2026
Hi,
On Wed, 2026-06-24 at 16:12 -0700, Cong Wang wrote:
> On Wed, Jun 24, 2026 at 12:54 AM Benjamin Berg
> <benjamin at sipsolutions.net> wrote:
> >
> > On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> > > On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> > > >
> > > > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > > > From: Cong Wang <cwang at multikernel.io>
> > > > >
> > > > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > > > list could not be addressed by the filter alone:
> > > > >
> > > > > - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > > > > intra-guest disclosure and, on this base (single physmem fd, no
> > > > > kernel/user split), a host escape; and
> > > > >
> > > > > - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > > > > evade preemption and wedge the monitor indefinitely.
> > > > >
> > > > > This series closes both:
> > > > >
> > > > > 1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > > > > owned by the monitor (no behavioural change yet).
> > > > > 3-4: validate each mmap() against the mm's page table -- allowed iff the
> > > > > PTE already maps the requested frame with no more access than it
> > > > > grants -- including out-of-batch mmaps a hijacked stub issues on
> > > > > its own.
> > > > > 5: route and validate munmap() the same way (range-confined below
> > > > > STUB_START).
> > > >
> > > >
> > > > That approach seems odd to me. Adding an explicit out-of-band check
> > > > means you require two extra context switches per mmap syscall. I would
> > > > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > > > My take is still that it is possible to carefully craft a SECCOMP
> > > > filter as well as stub/kernel code that makes exploitation impossible
> > > > for non-SMP.
> > > >
> > > > The true SMP case is more complicated, but we do not have that anyway,
> > > > so I would not worry about it for now.
> > > >
> > > > Did you run any performance tests?
> > >
> > > Sorry, I thought I included them in patch 4/7, but I missed them.
> > >
> > > Here they are:
> > >
> > > mmap() -- ns/call
> > > -----------------
> > > size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> > > 4K 13056 10368 +2688 (+26%) 17408 15104
> > > 16K 13824 11008 +2816 (+26%) 15616 12544
> > > 64K 19840 15616 +4224 (+27%) 21760 17408
> > > 256K 33152 28928 +4224 (+15%) 36352 31744
> > > 1M 95616 84608 +11008 (+13%) 117504 108032
> > >
> > > munmap() -- ns/call
> > > -------------------
> > > size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> > > 4K 11008 8448 +2560 (+30%) 13568 11264
> > > 16K 10752 8448 +2304 (+27%) 12288 9472
> > > 64K 12032 9728 +2304 (+24%) 13824 10752
> > > 256K 15360 12800 +2560 (+20%) 17152 14592
> > > 1M 27648 24832 +2816 (+11%) 30976 26880
> > >
> > > Since this is a clear security-vs-performance balance, I picked security
> > > over performance. Please let me know if you prefer otherwise.
> >
> > So, I still believe that (almost) zero-overhead is possible in the
> > current framework. Maybe one could add a non-zero overhead option, but
> > then I would like to be able to disable it as there are plenty of
> > scenarios where security is of no concern.
>
> This is fair, I can add an option to keep it disabled by default.
Sure. But I still think that a discussion is needed on whether another
option is viable. I would really prefer not adding your proposed
solution if another low/zero overhead option is viable.
And I still absolutely believe that it is viable to modify the current
code and make it secure. I just never finished that work because it
requires a lot of care and it is not that important to us.
> > Also, I am unsure what you measured there. When you write "mmap" here,
> > is that an "mmap" call made by the userspace process (which might not
> > actually change the page mappings), or is that an "mmap" call that was
> > queued for the stub to process?
> >
> > A large amount of the overhead that UML tends to have happens when
> > handling both minor and major page faults. Simple tests for this is for
> > example a relatively tight fork/exec loop or just applications starting
> > up. In our test runs, we saw major speed improvements (>10%) just by
> > fixing bugs in the TLB handling code that resulted in fewer page faults
> > happening at application runtime.
>
> It is just a simple user-space mmap() loop just for benchmarking this
> patchset.
If you just say "mmap() loop" I think that you are purely doing an
mmap/unmap syscall of anonymous memory in a loop. However, the kernel
may not map any of those pages until they are actually accessed. I
would not have been surprised if you did not see any performance
difference at all in such a test.
Are you actually touching (writing) each of the mapped pages?
> > > Meanwhile, after a second thought, there is a zero-overhead solution.
> > > pidfd_mmap(), which can be built on top of another patch of mine:
> > > https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
> >
> > Sure, should some sort of method land that allows directly modifying
> > the mappings of the remote process, that would be rather convenient.
> >
>
> Great. I will work on it. One question: should I drop this one or we can use
> this one as the short term solution before the pidfd_mmap() solution gets in?
> Even if we had pidfd_mmap(), UML might still want to keep compatibility with
> older host kernels?
Of course, UML should not regress on older kernels. Also, any new API
like that will probably still take a while. That said, I am not sure
there is a pressing need to fix the security problem as long as we have
ptrace mode available.
>
> > > Something like:
> > >
> > > int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
> > > int phys_fd, unsigned long long offset)
> > > {
> > > return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
> > > prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
> > > phys_fd, offset);
> > > }
> > > int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> > > {
> > > return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> > > }
> > >
> > > Obviously, this one requires a new syscall for the host kernel first.
> > >
> > > >
> > > > > 6: add a watchdog thread that detects a stub which stops reporting
> > > > > back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > > > > recover via the existing teardown.
> > > >
> > > > That also seems like an odd solution to me. Architecturally, UML first
> > > > receives the SIGALRM and forwards it to the child. It would seem much
> > > > easier to set a flag and clear it again when the process reports back
> > > > that it received the SIGALRM. Then, when the kernel receives the next
> > > > SIGALRM, just kill the child immediately if the flag is still set.
> > >
> > > The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> > > the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> > > signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> > > child" is accurate there. This patch is on the seccomp path
> > > (wait_stub_done_seccomp), which is architecturally different.
> > >
> > > Or am I missing anything?
> >
> > The fact that it is not actually different from an architectural
> > standpoint. In both cases SIGALRM is reported back to the kernel. It
> > obviously has to be reported back, otherwise scheduling would not work.
>
> You're right.
>
> In the seccomp path the monitor also receives the SIGALRM and
> forwards it to the stub: the per-vCPU POSIX timer targets the monitor thread
> and um_timer() -> os_alarm_process() -> kill(stub, SIGALRM) forwards
> the tick. ptrace vs. seccomp differ only in how the monitor
> learns the stub stopped (waitpid vs. futex), not in the point you raised.
>
> So the watchdog thread is redundant. The inner FUTEX_WAIT loop in
> wait_stub_done_seccomp already wakes on every forwarded tick via EINTR;
> responsive stub acks by flipping data->futex out of FUTEX_IN_CHILD, a
> SIGALRM-blocking one doesn't, exactly your flag scheme. I'll drop the helper
> thread and timerfd and detect the stall inline: count consecutive ticks where
> the stub didn't report and goto out_kill past a small threshold.
Not sure I follow and the description makes me wonder if it was written
by an LLM. To me it seems overly complicated and incorrect in subtle
ways.
Benjamin
More information about the linux-um
mailing list