[PATCH 0/7] um: skas: harden the seccomp userspace stub

Sat Jun 27 09:21:41 PDT 2026

Hi,

On Wed, 2026-06-24 at 16:12 -0700, Cong Wang wrote:
> On Wed, Jun 24, 2026 at 12:54 AM Benjamin Berg
> <benjamin at sipsolutions.net> wrote:
> > 
> > On Tue, 2026-06-23 at 15:08 -0700, Cong Wang wrote:
> > > On Mon, Jun 22, 2026 at 5:08 AM Benjamin Berg <benjamin at sipsolutions.net> wrote:
> > > > 
> > > > On Fri, 2026-06-19 at 20:22 -0700, Cong Wang wrote:
> > > > > From: Cong Wang <cwang at multikernel.io>
> > > > > 
> > > > > In the seccomp ("SECCOMP") userspace mode, each guest userspace process
> > > > > runs in a stub under a seccomp filter and traps to the monitor (the UML
> > > > > kernel) on every syscall. Two items on the stub.c "Known security issues"
> > > > > list could not be addressed by the filter alone:
> > > > > 
> > > > >   - a hijacked stub could mmap() arbitrary physmem offsets, which is an
> > > > >     intra-guest disclosure and, on this base (single physmem fd, no
> > > > >     kernel/user split), a host escape; and
> > > > > 
> > > > >   - a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
> > > > >     evade preemption and wedge the monitor indefinitely.
> > > > > 
> > > > > This series closes both:
> > > > > 
> > > > >   1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
> > > > >        owned by the monitor (no behavioural change yet).
> > > > >   3-4: validate each mmap() against the mm's page table -- allowed iff the
> > > > >        PTE already maps the requested frame with no more access than it
> > > > >        grants -- including out-of-batch mmaps a hijacked stub issues on
> > > > >        its own.
> > > > >   5:   route and validate munmap() the same way (range-confined below
> > > > >        STUB_START).
> > > > 
> > > > 
> > > > That approach seems odd to me. Adding an explicit out-of-band check
> > > > means you require two extra context switches per mmap syscall. I would
> > > > expect that this makes the SECCOMP approach a lot slower than ptrace().
> > > > My take is still that it is possible to carefully craft a SECCOMP
> > > > filter as well as stub/kernel code that makes exploitation impossible
> > > > for non-SMP.
> > > > 
> > > > The true SMP case is more complicated, but we do not have that anyway,
> > > > so I would not worry about it for now.
> > > > 
> > > > Did you run any performance tests?
> > > 
> > > Sorry, I thought I included them in patch 4/7, but I missed them.
> > > 
> > > Here they are:
> > > 
> > > mmap() -- ns/call
> > > -----------------
> > > size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> > > 4K     13056      10368         +2688  (+26%)     17408      15104
> > > 16K    13824      11008         +2816  (+26%)     15616      12544
> > > 64K    19840      15616         +4224  (+27%)     21760      17408
> > > 256K   33152      28928         +4224  (+15%)     36352      31744
> > > 1M     95616      84608         +11008 (+13%)     117504     108032
> > > 
> > > munmap() -- ns/call
> > > -------------------
> > > size   WITH med   WITHOUT med   delta            WITH p99   WITHOUT p99
> > > 4K     11008      8448          +2560  (+30%)     13568      11264
> > > 16K    10752      8448          +2304  (+27%)     12288      9472
> > > 64K    12032      9728          +2304  (+24%)     13824      10752
> > > 256K   15360      12800         +2560  (+20%)     17152      14592
> > > 1M     27648      24832         +2816  (+11%)     30976      26880
> > > 
> > > Since this is a clear security-vs-performance balance, I picked security
> > > over performance. Please let me know if you prefer otherwise.
> > 
> > So, I still believe that (almost) zero-overhead is possible in the
> > current framework. Maybe one could add a non-zero overhead option, but
> > then I would like to be able to disable it as there are plenty of
> > scenarios where security is of no concern.
> 
> This is fair, I can add an option to keep it disabled by default.

Sure. But I still think that a discussion is needed on whether another
option is viable. I would really prefer not adding your proposed
solution if another low/zero overhead option is viable.

And I still absolutely believe that it is viable to modify the current
code and make it secure. I just never finished that work because it
requires a lot of care and it is not that important to us.

> > Also, I am unsure what you measured there. When you write "mmap" here,
> > is that an "mmap" call made by the userspace process (which might not
> > actually change the page mappings), or is that an "mmap" call that was
> > queued for the stub to process?
> > 
> > A large amount of the overhead that UML tends to have happens when
> > handling both minor and major page faults. Simple tests for this is for
> > example a relatively tight fork/exec loop or just applications starting
> > up. In our test runs, we saw major speed improvements (>10%) just by
> > fixing bugs in the TLB handling code that resulted in fewer page faults
> > happening at application runtime.
> 
> It is just a simple user-space mmap() loop just for benchmarking this
> patchset.

If you just say "mmap() loop" I think that you are purely doing an
mmap/unmap syscall of anonymous memory in a loop. However, the kernel
may not map any of those pages until they are actually accessed. I
would not have been surprised if you did not see any performance
difference at all in such a test.

Are you actually touching (writing) each of the mapped pages?

> > > Meanwhile, after a second thought, there is a zero-overhead solution.
> > > pidfd_mmap(), which can be built on top of another patch of mine:
> > > https://lore.kernel.org/all/20260613001533.314739-2-xiyou.wangcong@gmail.com/
> > 
> > Sure, should some sort of method land that allows directly modifying
> > the mappings of the remote process, that would be rather convenient.
> > 
> 
> Great. I will work on it. One question: should I drop this one or we can use
> this one as the short term solution before the pidfd_mmap() solution gets in?
> Even if we had pidfd_mmap(), UML might still want to keep compatibility with
> older host kernels?

Of course, UML should not regress on older kernels. Also, any new API
like that will probably still take a while. That said, I am not sure
there is a pressing need to fix the security problem as long as we have
ptrace mode available.

> 
> > > Something like:
> > > 
> > > int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
> > >         int phys_fd, unsigned long long offset)
> > > {
> > >         return pidfd_mmap(mm_idp->stub_pidfd, (void *)virt, len,
> > >                           prot_to_mmap(prot), MAP_FIXED | MAP_SHARED,
> > >                           phys_fd, offset);
> > > }
> > > int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
> > > {
> > >         return pidfd_munmap(mm_idp->stub_pidfd, (void *)addr, len);
> > > }
> > > 
> > > Obviously, this one requires a new syscall for the host kernel first.
> > > 
> > > > 
> > > > >   6:   add a watchdog thread that detects a stub which stops reporting
> > > > >        back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
> > > > >        recover via the existing teardown.
> > > > 
> > > > That also seems like an odd solution to me. Architecturally, UML first
> > > > receives the SIGALRM and forwards it to the child. It would seem much
> > > > easier to set a flag and clear it again when the process reports back
> > > > that it received the SIGALRM. Then, when the kernel receives the next
> > > > SIGALRM, just kill the child immediately if the flag is still set.
> > > 
> > > The flag-and-recheck scheme matches the ptrace path (wait_stub_done), where
> > > the monitor is the tracer: it sees the stub's SIGALRM as a waitpid()
> > > signal-stop and PTRACE_CONTs it, so "monitor receives, then deals with the
> > > child" is accurate there. This patch is on the seccomp path
> > > (wait_stub_done_seccomp), which is architecturally different.
> > > 
> > > Or am I missing anything?
> > 
> > The fact that it is not actually different from an architectural
> > standpoint. In both cases SIGALRM is reported back to the kernel. It
> > obviously has to be reported back, otherwise scheduling would not work.
> 
> You're right.
> 
> In the seccomp path the monitor also receives the SIGALRM and
> forwards it to the stub: the per-vCPU POSIX timer targets the monitor thread
> and um_timer() -> os_alarm_process() -> kill(stub, SIGALRM) forwards
> the tick. ptrace vs. seccomp differ only in how the monitor
> learns the stub stopped (waitpid vs. futex), not in the point you raised.
> 
> So the watchdog thread is redundant. The inner FUTEX_WAIT loop in
> wait_stub_done_seccomp already wakes on every forwarded tick via EINTR;
> responsive stub acks by flipping data->futex out of FUTEX_IN_CHILD, a
> SIGALRM-blocking one doesn't, exactly your flag scheme. I'll drop the helper
> thread and timerfd and detect the stall inline: count consecutive ticks where
> the stub didn't report and goto out_kill past a small threshold.

Not sure I follow and the description makes me wonder if it was written
by an LLM. To me it seems overly complicated and incorrect in subtle
ways.

Benjamin