[PATCH v10 09/13] x86/um: nommu: signal handling

Tue Jun 24 16:20:03 PDT 2025

Hi,

On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> This commit updates the behavior of signal handling under !MMU
> environment. It adds the alignment code for signal frame as the frame
> is used in userspace as-is.
> 
> floating point register is carefully handling upon entry/leave of
> syscall routine so that signal handlers can read/write the contents of
> the register.
> 
> It also adds the follow up routine for SIGSEGV as a signal delivery runs
> in the same stack frame while we have to avoid endless SIGSEGV.
> 
> Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
> ---
>  arch/um/include/shared/kern_util.h    |   4 +
>  arch/um/nommu/Makefile                |   2 +-
>  arch/um/nommu/os-Linux/signal.c       |  13 ++
>  arch/um/nommu/trap.c                  | 194 ++++++++++++++++++++++++++
>  arch/x86/um/nommu/do_syscall_64.c     |   6 +
>  arch/x86/um/nommu/os-Linux/mcontext.c |  11 ++
>  arch/x86/um/shared/sysdep/mcontext.h  |   1 +
>  arch/x86/um/shared/sysdep/ptrace.h    |   2 +-
>  8 files changed, 231 insertions(+), 2 deletions(-)
>  create mode 100644 arch/um/nommu/trap.c
> 
> [SNIP]
> diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> index c4ef877d5ea0..955e7d9f4765 100644
> --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> @@ -6,6 +6,17 @@
>  #include <sysdep/mcontext.h>
>  #include <sysdep/syscalls.h>
>  
> +static void __userspace_relay_signal(void)
> +{
> + /* XXX: dummy syscall */
> + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> +}

39 is NR__getpid, I assume?

The "call *%0" looks like it is code for retpolin, I think this would
currently just segfault.

> +
> +void set_mc_userspace_relay_signal(mcontext_t *mc)
> +{
> + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> +}
> +

And this is really confusing me. The way I am reading it, the code
tries to do:
   1. Rewrite RIP to jump to __userspace_relay_signal
   2. Trigger a getpid syscall (to do "nothing"?)
   3. Let do_syscall_64 fire the signal from interrupt_end

However, then that really confuses me, because:
 * If I am reading it correctly, then this approach will destroy the
   contents of various registers (RIP, RAX and likely more)
 * This would result in an incorrect mcontext in the userspace signal
   handler (which could be relevant if userspace is inspecting it)
 * However, worst, rt_sigreturn will eventually jump back
   into__userspace_relay_signal, which has nothing to return to.
 * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
   is userspace interrupted immediately in that case?

Honestly, I really think we should take a step back and swap the
current syscall entry/exit code. That would likely also simplify
floating point register handling, which I think is currently
insufficient do deal with the odd special cases caused by different
x86_64 hardware extensions.

Basically, I think nommu mode should use the same general approach as
the current SECCOMP mode. Which is to use rt_sigreturn to jump into
userspace and let the host kernel deal with the ugly details of how to
do that.

I believe that this requires a second "userspace" sigaltstack in
addition to the current "IRQ" sigaltstack. Then switching in between
the two (note that the "userspace" one is also used for IRQs if those
happen while userspace is executing).

So, in principle I would think something like:
 * to jump into userspace, you would:
    - block all signals
    - set "userspace" sigaltstack
    - setup mcontext for rt_sigreturn
    - setup RSP for rt_sigreturn
    - call rt_sigreturn syscall
 * all signal handlers can (except pure IRQs):
    - check on which stack they are
      -> easy to detect whether we are in kernel mode
    - for IRQs one can probably handle them directly (and return)
    - in user mode:
       + store mcontext location and information needed for rt_sigreturn
       + jump back into kernel task stack
 * kernel task handler to continue would:
    - set sigaltstack to IRQ stack
    - fetch register from mcontext
    - unblock all signals
    - handle syscall/signal in whatever way needed

Now that I wrote about it, I am thinking that it might be possible to
just use the kernel task stack for the signal stack. One would probably
need to increase the kernel stack size a bit, but it would also mean
that no special code is needed for "rt_sigreturn" handling. The rest
would remain the same.

Thoughts?

Benjamin

> [SNIP]