[PATCH 0/7] um: skas: harden the seccomp userspace stub

Wed Jul 1 14:31:34 PDT 2026

On Sat, Jun 27, 2026 at 9:21 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
>
> Hi,
>
> On Wed, 2026-06-24 at 16:12 -0700, Cong Wang wrote:
> > On Wed, Jun 24, 2026 at 12:54 AM Benjamin Berg
> > <benjamin at sipsolutions.net> wrote:
> > >
> > > On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> > > > On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> > > > >
> > > > > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > > > > From: Cong Wang <cwang at multikernel.io>
> > > > > >
> > > > > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > > > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > > > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > > > > list could not be addressed by the filter alone:
> > > > > >
> > > > > >   - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > > > > >     intra-guest disclosure and, on this base (single physmem fd, no
> > > > > >     kernel/user split), a host escape; and
> > > > > >
> > > > > >   - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > > > > >     evade preemption and wedge the monitor indefinitely.
> > > > > >
> > > > > > This series closes both:
> > > > > >
> > > > > >   1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > > > > >        owned by the monitor (no behavioural change yet).
> > > > > >   3-4: validate each mmap() against the mm's page table -- allowed iff the
> > > > > >        PTE already maps the requested frame with no more access than it
> > > > > >        grants -- including out-of-batch mmaps a hijacked stub issues on
> > > > > >        its own.
> > > > > >   5:   route and validate munmap() the same way (range-confined below
> > > > > >        STUB_START).
> > > > >
> > > > >
> > > > > That approach seems odd to me. Adding an explicit out-of-band check
> > > > > means you require two extra context switches per mmap syscall. I would
> > > > > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > > > > My take is still that it is possible to carefully craft a SECCOMP
> > > > > filter as well as stub/kernel code that makes exploitation impossible
> > > > > for non-SMP.
> > > > >
> > > > > The true SMP case is more complicated, but we do not have that anyway,
> > > > > so I would not worry about it for now.
> > > > >
> > > > > Did you run any performance tests?
> > > >
> > > > Sorry, I thought I included them in patch 4/7, but I missed them.
> > > >
> > > > Here they are:
> > > >
> > > > mmap() -- ns/call
> > > > -----------------
> > > > size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> > > > 4K     13056      10368         +2688  (+26%)     17408      15104
> > > > 16K    13824      11008         +2816  (+26%)     15616      12544
> > > > 64K    19840      15616         +4224  (+27%)     21760      17408
> > > > 256K   33152      28928         +4224  (+15%)     36352      31744
> > > > 1M     95616      84608         +11008 (+13%)     117504     108032
> > > >
> > > > munmap() -- ns/call
> > > > -------------------
> > > > size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> > > > 4K     11008      8448          +2560  (+30%)     13568      11264
> > > > 16K    10752      8448          +2304  (+27%)     12288      9472
> > > > 64K    12032      9728          +2304  (+24%)     13824      10752
> > > > 256K   15360      12800         +2560  (+20%)     17152      14592
> > > > 1M     27648      24832         +2816  (+11%)     30976      26880
> > > >
> > > > Since this is a clear security-vs-performance balance, I picked security
> > > > over performance. Please let me know if you prefer otherwise.
> > >
> > > So, I still believe that (almost) zero-overhead is possible in the
> > > current framework. Maybe one could add a non-zero overhead option, but
> > > then I would like to be able to disable it as there are plenty of
> > > scenarios where security is of no concern.
> >
> > This is fair, I can add an option to keep it disabled by default.
>
> Sure. But I still think that a discussion is needed on whether another
> option is viable. I would really prefer not adding your proposed
> solution if another low/zero overhead option is viable.
>
> And I still absolutely believe that it is viable to modify the current
> code and make it secure. I just never finished that work because it
> requires a lot of care and it is not that important to us.

OK. Just to provide more context: I am evaluating using UML for
multi-tenant cloud, security is a must there, not just optional.

I guess I might be the first one doing so, otherwise these security
issues would have been resolved. :)

>
> > > Also, I am unsure what you measured there. When you write "mmap" here,
> > > is that an "mmap" call made by the userspace process (which might not
> > > actually change the page mappings), or is that an "mmap" call that was
> > > queued for the stub to process?
> > >
> > > A large amount of the overhead that UML tends to have happens when
> > > handling both minor and major page faults. Simple tests for this is for
> > > example a relatively tight fork/exec loop or just applications starting
> > > up. In our test runs, we saw major speed improvements (>10%) just by
> > > fixing bugs in the TLB handling code that resulted in fewer page faults
> > > happening at application runtime.
> >
> > It is just a simple user-space mmap() loop just for benchmarking this
> > patchset.
>
> If you just say "mmap() loop" I think that you are purely doing an
> mmap/unmap syscall of anonymous memory in a loop. However, the kernel
> may not map any of those pages until they are actually accessed. I
> would not have been surprised if you did not see any performance
> difference at all in such a test.
>
> Are you actually touching (writing) each of the mapped pages?

Yes, with MAP_POPULATE. Sorry for forgetting to mention it.

>
> > > > Meanwhile, after a second thought, there is a zero-overhead solution.
> > > > pidfd_mmap(), which can be built on top of another patch of mine:
> > > > https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
> > >
> > > Sure, should some sort of method land that allows directly modifying
> > > the mappings of the remote process, that would be rather convenient.
> > >
> >
> > Great. I will work on it. One question: should I drop this one or we can use
> > this one as the short term solution before the pidfd_mmap() solution gets in?
> > Even if we had pidfd_mmap(), UML might still want to keep compatibility with
> > older host kernels?
>
> Of course, UML should not regress on older kernels. Also, any new API
> like that will probably still take a while. That said, I am not sure
> there is a pressing need to fix the security problem as long as we have
> ptrace mode available.

Maybe I am greedy, I want both security and performance. I ruled out
ptrace model due to low performance and try to improve seccomp mode
for better security.

Anyway, I agree to drop this patchset and work on pidfd_mmap(),
I will send them out shortly.

>
> >
> > > > Something like:
> > > >
> > > > int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
> > > >         int phys_fd, unsigned long long offset)
> > > > {
> > > >         return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
> > > >                           prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
> > > >                           phys_fd, offset);
> > > > }
> > > > int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> > > > {
> > > >         return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> > > > }
> > > >
> > > > Obviously, this one requires a new syscall for the host kernel first.
> > > >
> > > > >
> > > > > >   6:   add a watchdog thread that detects a stub which stops reporting
> > > > > >        back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > > > > >        recover via the existing teardown.
> > > > >
> > > > > That also seems like an odd solution to me. Architecturally, UML first
> > > > > receives the SIGALRM and forwards it to the child. It would seem much
> > > > > easier to set a flag and clear it again when the process reports back
> > > > > that it received the SIGALRM. Then, when the kernel receives the next
> > > > > SIGALRM, just kill the child immediately if the flag is still set.
> > > >
> > > > The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> > > > the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> > > > signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> > > > child" is accurate there. This patch is on the seccomp path
> > > > (wait_stub_done_seccomp), which is architecturally different.
> > > >
> > > > Or am I missing anything?
> > >
> > > The fact that it is not actually different from an architectural
> > > standpoint. In both cases SIGALRM is reported back to the kernel. It
> > > obviously has to be reported back, otherwise scheduling would not work.
> >
> > You're right.
> >
> > In the seccomp path the monitor also receives the SIGALRM and
> > forwards it to the stub: the per-vCPU POSIX timer targets the monitor thread
> > and um_timer() -> os_alarm_process() -> kill(stub, SIGALRM) forwards
> > the tick. ptrace vs. seccomp differ only in how the monitor
> > learns the stub stopped (waitpid vs. futex), not in the point you raised.
> >
> > So the watchdog thread is redundant. The inner FUTEX_WAIT loop in
> > wait_stub_done_seccomp already wakes on every forwarded tick via EINTR;
> > responsive stub acks by flipping data->futex out of FUTEX_IN_CHILD, a
> > SIGALRM-blocking one doesn't, exactly your flag scheme. I'll drop the helper
> > thread and timerfd and detect the stall inline: count consecutive ticks where
> > the stub didn't report and goto out_kill past a small threshold.
>
> Not sure I follow and the description makes me wonder if it was written
> by an LLM. To me it seems overly complicated and incorrect in subtle
> ways.

Yes, I marked it with Assisted-by: Claude:claude-opus-4.8.

Thanks.