[PATCH v10 09/13] x86/um: nommu: signal handling

Fri Jun 27 06:50:41 PDT 2025

Hello,

thanks for the comment on the complicated part of the kernel (signal).

On Wed, 25 Jun 2025 08:20:03 +0900,
Benjamin Berg wrote:
> 
> Hi,
> 
> On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> > This commit updates the behavior of signal handling under !MMU
> > environment. It adds the alignment code for signal frame as the frame
> > is used in userspace as-is.
> > 
> > floating point register is carefully handling upon entry/leave of
> > syscall routine so that signal handlers can read/write the contents of
> > the register.
> > 
> > It also adds the follow up routine for SIGSEGV as a signal delivery runs
> > in the same stack frame while we have to avoid endless SIGSEGV.
> > 
> > Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
> > ---
> >  arch/um/include/shared/kern_util.h    |   4 +
> >  arch/um/nommu/Makefile                |   2 +-
> >  arch/um/nommu/os-Linux/signal.c       |  13 ++
> >  arch/um/nommu/trap.c                  | 194 ++++++++++++++++++++++++++
> >  arch/x86/um/nommu/do_syscall_64.c     |   6 +
> >  arch/x86/um/nommu/os-Linux/mcontext.c |  11 ++
> >  arch/x86/um/shared/sysdep/mcontext.h  |   1 +
> >  arch/x86/um/shared/sysdep/ptrace.h    |   2 +-
> >  8 files changed, 231 insertions(+), 2 deletions(-)
> >  create mode 100644 arch/um/nommu/trap.c
> > 
> > [SNIP]
> > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> > index c4ef877d5ea0..955e7d9f4765 100644
> > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > @@ -6,6 +6,17 @@
> >  #include <sysdep/mcontext.h>
> >  #include <sysdep/syscalls.h>
> >  
> > +static void __userspace_relay_signal(void)
> > +{
> > + /* XXX: dummy syscall */
> > + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> > +}
> 
> 39 is NR__getpid, I assume?
> 
> The "call *%0" looks like it is code for retpolin, I think this would
> currently just segfault.

# if you mean retpolin as zpoline,

zploine uses `call *%rax` so, this is not about zpoline.

> > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > +{
> > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > +}
> > +

This is a bit scary code which I tried to handle when SIGSEGV is
raised by host for a userspace program running on UML (nommu).

# and I should remember my XXX tag is important to fix....

let me try to explain what happens and what I tried to solve.

The SEGV signal from userspace program is delivered to userspace but
if we don't fix the code raising the signal, after (um) rt_sigreturn,
it will restart from $rip and raise SIGSEGV again.

# so, yes, we've already relied on host and um's rt_sigreturn to
  restore various things.

when a uml userspace crashes with SIGSEGV,

- host kernel raises SIGSEGV (at original $rip)
- caught by uml process (hard_handler)
- raise a signal to uml userspace process (segv_handler)
- handler ends (hard_handler)
- (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
  not (host) rt_sigaction)
- return back to the original $rip
- (back to top)

this is the case where endless loop is happened.
um's sa_handler isn't called as rt_sigreturn (um) isn't called.
and the my original attempt (__userspace_relay_signal) is what I tried.

I agree that it is lazy to call a dummy syscall (indeed, getpid).
I'm trying to introduce another routine to jump into userspace and
call (um) rt_sigreturn after (host) rt_sigreturn.

> And this is really confusing me. The way I am reading it, the code
> tries to do:
>    1. Rewrite RIP to jump to __userspace_relay_signal
>    2. Trigger a getpid syscall (to do "nothing"?)
>    3. Let do_syscall_64 fire the signal from interrupt_end

correct.

> However, then that really confuses me, because:
>  * If I am reading it correctly, then this approach will destroy the
>    contents of various registers (RIP, RAX and likely more)
>  * This would result in an incorrect mcontext in the userspace signal
>    handler (which could be relevant if userspace is inspecting it)
>  * However, worst, rt_sigreturn will eventually jump back
>    into__userspace_relay_signal, which has nothing to return to.
>  * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
>    is userspace interrupted immediately in that case?

relay_signal shares the same goal of this, indeed.
but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
I guess.

> Honestly, I really think we should take a step back and swap the
> current syscall entry/exit code. That would likely also simplify
> floating point register handling, which I think is currently
> insufficient do deal with the odd special cases caused by different
> x86_64 hardware extensions.
> 
> Basically, I think nommu mode should use the same general approach as
> the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> userspace and let the host kernel deal with the ugly details of how to
> do that.

I looked at how MMU mode (ptrace/seccomp) does handle this case.

In nommu mode, we don't have external process to catch signals so, the
nommu mode uses hard_handler() to catch SEGV/FPE of userspace
programs.  While mmu mode calls segv_handler not in a context of
signal handler.

# correct me if I'm wrong.

thus, mmu mode doesn't have this situation.

I'm attempting various ways; calling um's rt_sigreturn instead of
host's one, which doesn't work as host restore procedures (unblocking
masked signals, restoring register states, etc) aren't called.

I'll update here if I found a good direction, but would be great if
you see how it should be handled.

-- Hajime

> I believe that this requires a second "userspace" sigaltstack in
> addition to the current "IRQ" sigaltstack. Then switching in between
> the two (note that the "userspace" one is also used for IRQs if those
> happen while userspace is executing).
> 
> So, in principle I would think something like:
>  * to jump into userspace, you would:
>     - block all signals
>     - set "userspace" sigaltstack
>     - setup mcontext for rt_sigreturn
>     - setup RSP for rt_sigreturn
>     - call rt_sigreturn syscall
>  * all signal handlers can (except pure IRQs):
>     - check on which stack they are
>       -> easy to detect whether we are in kernel mode
>     - for IRQs one can probably handle them directly (and return)
>     - in user mode:
>        + store mcontext location and information needed for rt_sigreturn
>        + jump back into kernel task stack
>  * kernel task handler to continue would:
>     - set sigaltstack to IRQ stack
>     - fetch register from mcontext
>     - unblock all signals
>     - handle syscall/signal in whatever way needed
> 
> Now that I wrote about it, I am thinking that it might be possible to
> just use the kernel task stack for the signal stack. One would probably
> need to increase the kernel stack size a bit, but it would also mean
> that no special code is needed for "rt_sigreturn" handling. The rest
> would remain the same.
> 
> Thoughts?
> 
> Benjamin
> 
> > [SNIP]
>