[PATCH 0/7] um: skas: harden the seccomp userspace stub
Cong Wang
xiyou.wangcong at gmail.com
Wed Jul 1 14:31:34 PDT 2026
On Sat, Jun 27, 2026 at 9:21 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
>
> Hi,
>
> On Wed, 2026-06-24 at 16:12 -0700, Cong Wang wrote:
> > On Wed, Jun 24, 2026 at 12:54 AM Benjamin Berg
> > <benjamin at sipsolutions.net> wrote:
> > >
> > > On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> > > > On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> > > > >
> > > > > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > > > > From: Cong Wang <cwang at multikernel.io>
> > > > > >
> > > > > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > > > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > > > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > > > > list could not be addressed by the filter alone:
> > > > > >
> > > > > > - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > > > > > intra-guest disclosure and, on this base (single physmem fd, no
> > > > > > kernel/user split), a host escape; and
> > > > > >
> > > > > > - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > > > > > evade preemption and wedge the monitor indefinitely.
> > > > > >
> > > > > > This series closes both:
> > > > > >
> > > > > > 1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > > > > > owned by the monitor (no behavioural change yet).
> > > > > > 3-4: validate each mmap() against the mm's page table -- allowed iff the
> > > > > > PTE already maps the requested frame with no more access than it
> > > > > > grants -- including out-of-batch mmaps a hijacked stub issues on
> > > > > > its own.
> > > > > > 5: route and validate munmap() the same way (range-confined below
> > > > > > STUB_START).
> > > > >
> > > > >
> > > > > That approach seems odd to me. Adding an explicit out-of-band check
> > > > > means you require two extra context switches per mmap syscall. I would
> > > > > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > > > > My take is still that it is possible to carefully craft a SECCOMP
> > > > > filter as well as stub/kernel code that makes exploitation impossible
> > > > > for non-SMP.
> > > > >
> > > > > The true SMP case is more complicated, but we do not have that anyway,
> > > > > so I would not worry about it for now.
> > > > >
> > > > > Did you run any performance tests?
> > > >
> > > > Sorry, I thought I included them in patch 4/7, but I missed them.
> > > >
> > > > Here they are:
> > > >
> > > > mmap() -- ns/call
> > > > -----------------
> > > > size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> > > > 4K 13056 10368 +2688 (+26%) 17408 15104
> > > > 16K 13824 11008 +2816 (+26%) 15616 12544
> > > > 64K 19840 15616 +4224 (+27%) 21760 17408
> > > > 256K 33152 28928 +4224 (+15%) 36352 31744
> > > > 1M 95616 84608 +11008 (+13%) 117504 108032
> > > >
> > > > munmap() -- ns/call
> > > > -------------------
> > > > size WITH med WITHOUT med delta WITH p99 WITHOUT p99
> > > > 4K 11008 8448 +2560 (+30%) 13568 11264
> > > > 16K 10752 8448 +2304 (+27%) 12288 9472
> > > > 64K 12032 9728 +2304 (+24%) 13824 10752
> > > > 256K 15360 12800 +2560 (+20%) 17152 14592
> > > > 1M 27648 24832 +2816 (+11%) 30976 26880
> > > >
> > > > Since this is a clear security-vs-performance balance, I picked security
> > > > over performance. Please let me know if you prefer otherwise.
> > >
> > > So, I still believe that (almost) zero-overhead is possible in the
> > > current framework. Maybe one could add a non-zero overhead option, but
> > > then I would like to be able to disable it as there are plenty of
> > > scenarios where security is of no concern.
> >
> > This is fair, I can add an option to keep it disabled by default.
>
> Sure. But I still think that a discussion is needed on whether another
> option is viable. I would really prefer not adding your proposed
> solution if another low/zero overhead option is viable.
>
> And I still absolutely believe that it is viable to modify the current
> code and make it secure. I just never finished that work because it
> requires a lot of care and it is not that important to us.
OK. Just to provide more context: I am evaluating using UML for
multi-tenant cloud, security is a must there, not just optional.
I guess I might be the first one doing so, otherwise these security
issues would have been resolved. :)
>
> > > Also, I am unsure what you measured there. When you write "mmap" here,
> > > is that an "mmap" call made by the userspace process (which might not
> > > actually change the page mappings), or is that an "mmap" call that was
> > > queued for the stub to process?
> > >
> > > A large amount of the overhead that UML tends to have happens when
> > > handling both minor and major page faults. Simple tests for this is for
> > > example a relatively tight fork/exec loop or just applications starting
> > > up. In our test runs, we saw major speed improvements (>10%) just by
> > > fixing bugs in the TLB handling code that resulted in fewer page faults
> > > happening at application runtime.
> >
> > It is just a simple user-space mmap() loop just for benchmarking this
> > patchset.
>
> If you just say "mmap() loop" I think that you are purely doing an
> mmap/unmap syscall of anonymous memory in a loop. However, the kernel
> may not map any of those pages until they are actually accessed. I
> would not have been surprised if you did not see any performance
> difference at all in such a test.
>
> Are you actually touching (writing) each of the mapped pages?
Yes, with MAP_POPULATE. Sorry for forgetting to mention it.
>
> > > > Meanwhile, after a second thought, there is a zero-overhead solution.
> > > > pidfd_mmap(), which can be built on top of another patch of mine:
> > > > https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
> > >
> > > Sure, should some sort of method land that allows directly modifying
> > > the mappings of the remote process, that would be rather convenient.
> > >
> >
> > Great. I will work on it. One question: should I drop this one or we can use
> > this one as the short term solution before the pidfd_mmap() solution gets in?
> > Even if we had pidfd_mmap(), UML might still want to keep compatibility with
> > older host kernels?
>
> Of course, UML should not regress on older kernels. Also, any new API
> like that will probably still take a while. That said, I am not sure
> there is a pressing need to fix the security problem as long as we have
> ptrace mode available.
Maybe I am greedy, I want both security and performance. I ruled out
ptrace model due to low performance and try to improve seccomp mode
for better security.
Anyway, I agree to drop this patchset and work on pidfd_mmap(),
I will send them out shortly.
>
> >
> > > > Something like:
> > > >
> > > > int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
> > > > int phys_fd, unsigned long long offset)
> > > > {
> > > > return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
> > > > prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
> > > > phys_fd, offset);
> > > > }
> > > > int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> > > > {
> > > > return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> > > > }
> > > >
> > > > Obviously, this one requires a new syscall for the host kernel first.
> > > >
> > > > >
> > > > > > 6: add a watchdog thread that detects a stub which stops reporting
> > > > > > back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > > > > > recover via the existing teardown.
> > > > >
> > > > > That also seems like an odd solution to me. Architecturally, UML first
> > > > > receives the SIGALRM and forwards it to the child. It would seem much
> > > > > easier to set a flag and clear it again when the process reports back
> > > > > that it received the SIGALRM. Then, when the kernel receives the next
> > > > > SIGALRM, just kill the child immediately if the flag is still set.
> > > >
> > > > The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> > > > the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> > > > signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> > > > child" is accurate there. This patch is on the seccomp path
> > > > (wait_stub_done_seccomp), which is architecturally different.
> > > >
> > > > Or am I missing anything?
> > >
> > > The fact that it is not actually different from an architectural
> > > standpoint. In both cases SIGALRM is reported back to the kernel. It
> > > obviously has to be reported back, otherwise scheduling would not work.
> >
> > You're right.
> >
> > In the seccomp path the monitor also receives the SIGALRM and
> > forwards it to the stub: the per-vCPU POSIX timer targets the monitor thread
> > and um_timer() -> os_alarm_process() -> kill(stub, SIGALRM) forwards
> > the tick. ptrace vs. seccomp differ only in how the monitor
> > learns the stub stopped (waitpid vs. futex), not in the point you raised.
> >
> > So the watchdog thread is redundant. The inner FUTEX_WAIT loop in
> > wait_stub_done_seccomp already wakes on every forwarded tick via EINTR;
> > responsive stub acks by flipping data->futex out of FUTEX_IN_CHILD, a
> > SIGALRM-blocking one doesn't, exactly your flag scheme. I'll drop the helper
> > thread and timerfd and detect the stall inline: count consecutive ticks where
> > the stub didn't report and goto out_kill past a small threshold.
>
> Not sure I follow and the description makes me wonder if it was written
> by an LLM. To me it seems overly complicated and incorrect in subtle
> ways.
Yes, I marked it with Assisted-by: Claude:claude-opus-4.8.
Thanks.
More information about the linux-um
mailing list