[PATCH v10 09/13] x86/um: nommu: signal handling

Fri Jun 27 08:02:05 PDT 2025

Hi,

On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> thanks for the comment on the complicated part of the kernel (signal).

This stuff isn't simple.

Actually, I am starting to think that the current MMU UML kernel also
needs a redesign with regard to signal handling and stack use in that
case. My current impression is that the design right now only permits
voluntarily scheduling. More specifically, scheduling in response to an
interrupt is impossible.

I suppose that works fine, but it also does not seem quite right.

> On Wed, 25 Jun 2025 08:20:03 +0900,
> Benjamin Berg wrote:
> > 
> > Hi,
> > 
> > On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> > > This commit updates the behavior of signal handling under !MMU
> > > environment. It adds the alignment code for signal frame as the frame
> > > is used in userspace as-is.
> > > 
> > > floating point register is carefully handling upon entry/leave of
> > > syscall routine so that signal handlers can read/write the contents of
> > > the register.
> > > 
> > > It also adds the follow up routine for SIGSEGV as a signal delivery runs
> > > in the same stack frame while we have to avoid endless SIGSEGV.
> > > 
> > > Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
> > > ---
> > >  arch/um/include/shared/kern_util.h    |   4 +
> > >  arch/um/nommu/Makefile                |   2 +-
> > >  arch/um/nommu/os-Linux/signal.c       |  13 ++
> > >  arch/um/nommu/trap.c                  | 194 ++++++++++++++++++++++++++
> > >  arch/x86/um/nommu/do_syscall_64.c     |   6 +
> > >  arch/x86/um/nommu/os-Linux/mcontext.c |  11 ++
> > >  arch/x86/um/shared/sysdep/mcontext.h  |   1 +
> > >  arch/x86/um/shared/sysdep/ptrace.h    |   2 +-
> > >  8 files changed, 231 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/um/nommu/trap.c
> > > 
> > > [SNIP]
> > > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > index c4ef877d5ea0..955e7d9f4765 100644
> > > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > @@ -6,6 +6,17 @@
> > >  #include <sysdep/mcontext.h>
> > >  #include <sysdep/syscalls.h>
> > >  
> > > +static void __userspace_relay_signal(void)
> > > +{
> > > + /* XXX: dummy syscall */
> > > + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> > > +}
> > 
> > 39 is NR__getpid, I assume?
> > 
> > The "call *%0" looks like it is code for retpolin, I think this would
> > currently just segfault.
> 
> # if you mean retpolin as zpoline,
> 
> zploine uses `call *%rax` so, this is not about zpoline.

Ah, yes, of course.

> > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > +{
> > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > +}
> > > +
> 
> This is a bit scary code which I tried to handle when SIGSEGV is
> raised by host for a userspace program running on UML (nommu).
> 
> # and I should remember my XXX tag is important to fix....
> 
> let me try to explain what happens and what I tried to solve.
> 
> The SEGV signal from userspace program is delivered to userspace but
> if we don't fix the code raising the signal, after (um) rt_sigreturn,
> it will restart from $rip and raise SIGSEGV again.
> 
> # so, yes, we've already relied on host and um's rt_sigreturn to
>   restore various things.
> 
> when a uml userspace crashes with SIGSEGV,
> 
> - host kernel raises SIGSEGV (at original $rip)
> - caught by uml process (hard_handler)
> - raise a signal to uml userspace process (segv_handler)
> - handler ends (hard_handler)
> - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
>   not (host) rt_sigaction)
> - return back to the original $rip
> - (back to top)
> 
> this is the case where endless loop is happened.
> um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> and the my original attempt (__userspace_relay_signal) is what I tried.
> 
> I agree that it is lazy to call a dummy syscall (indeed, getpid).
> I'm trying to introduce another routine to jump into userspace and
> call (um) rt_sigreturn after (host) rt_sigreturn.
> 
> > And this is really confusing me. The way I am reading it, the code
> > tries to do:
> >    1. Rewrite RIP to jump to __userspace_relay_signal
> >    2. Trigger a getpid syscall (to do "nothing"?)
> >    3. Let do_syscall_64 fire the signal from interrupt_end
> 
> correct.
> 
> > However, then that really confuses me, because:
> >  * If I am reading it correctly, then this approach will destroy the
> >    contents of various registers (RIP, RAX and likely more)
> >  * This would result in an incorrect mcontext in the userspace signal
> >    handler (which could be relevant if userspace is inspecting it)
> >  * However, worst, rt_sigreturn will eventually jump back
> >    into__userspace_relay_signal, which has nothing to return to.
> >  * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> >    is userspace interrupted immediately in that case?
> 
> relay_signal shares the same goal of this, indeed.
> but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> I guess.

Well, endless signals only exist as long as you exit to the same
location. My suggestion was to read the user state from the mcontext
(as SECCOMP mode does it) and executing the signal right away, i.e.:
 * Fetch the current registers from the mcontext
 * Push the signal context onto the userspace stack
 * Modify the host mcontext to set registers for the signal handler
 * Jump back to userspace by doing a "return"

Said differently, I really prefer deferring as much logic as possible
to the host. This is both safer and easier to understand. Plus, it also
has the advantage of making it simpler to port UML to other
architectures.

> > Honestly, I really think we should take a step back and swap the
> > current syscall entry/exit code. That would likely also simplify
> > floating point register handling, which I think is currently
> > insufficient do deal with the odd special cases caused by different
> > x86_64 hardware extensions.
> > 
> > Basically, I think nommu mode should use the same general approach as
> > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > userspace and let the host kernel deal with the ugly details of how to
> > do that.
> 
> I looked at how MMU mode (ptrace/seccomp) does handle this case.
> 
> In nommu mode, we don't have external process to catch signals so, the
> nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> programs.  While mmu mode calls segv_handler not in a context of
> signal handler.
> 
> # correct me if I'm wrong.
> 
> thus, mmu mode doesn't have this situation.

Yes, it does not have this specific issue. But see the top of the mail
for other issues that are somewhat related.

> I'm attempting various ways; calling um's rt_sigreturn instead of
> host's one, which doesn't work as host restore procedures (unblocking
> masked signals, restoring register states, etc) aren't called.
> 
> I'll update here if I found a good direction, but would be great if
> you see how it should be handled.

Can we please discuss possible solutions? We can figure out the details
once it is clear how the interaction with the host should work.

I still think that the idea of using the kernel task stack as the
signal stack is really elegant. Actually, doing that in normal UML may
be how we can fix the issues mentioned at the top of my mail. And for
nommu, we can also use the host mcontext to jump back into userspace
using a simple "return".

Conceptually it seems so simple.

Benjamin

> 
> -- Hajime
> 
> > I believe that this requires a second "userspace" sigaltstack in
> > addition to the current "IRQ" sigaltstack. Then switching in between
> > the two (note that the "userspace" one is also used for IRQs if those
> > happen while userspace is executing).
> > 
> > So, in principle I would think something like:
> >  * to jump into userspace, you would:
> >     - block all signals
> >     - set "userspace" sigaltstack
> >     - setup mcontext for rt_sigreturn
> >     - setup RSP for rt_sigreturn
> >     - call rt_sigreturn syscall
> >  * all signal handlers can (except pure IRQs):
> >     - check on which stack they are
> >       -> easy to detect whether we are in kernel mode
> >     - for IRQs one can probably handle them directly (and return)
> >     - in user mode:
> >        + store mcontext location and information needed for rt_sigreturn
> >        + jump back into kernel task stack
> >  * kernel task handler to continue would:
> >     - set sigaltstack to IRQ stack
> >     - fetch register from mcontext
> >     - unblock all signals
> >     - handle syscall/signal in whatever way needed
> > 
> > Now that I wrote about it, I am thinking that it might be possible to
> > just use the kernel task stack for the signal stack. One would probably
> > need to increase the kernel stack size a bit, but it would also mean
> > that no special code is needed for "rt_sigreturn" handling. The rest
> > would remain the same.
> > 
> > Thoughts?
> > 
> > Benjamin
> > 
> > > [SNIP]
> > 
>