[PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
Björn Töpel
bjorn at kernel.org
Mon Oct 21 07:00:18 PDT 2024
Celeste!
I'll pick up your thread [1] here, since there's code here! :-) Let's
try to clear up/define the possible flow.
The common/generic entry syscall_enter_from_user_mode{,_work}() says
that a return value of -1 means that the syscall should be skipped.
The RISC-V calling convention uses the same register for arg0/retval
(a0). So, when a ptracer (PTRACE_SYSCALL+PTRACE_GETREGS). That means
that the kernel cannot call into the tracer, *after* changing a0.
The interesting flow for a syscall is roughly:
1. The exception/trap function
2. syscall_enter_from_user_mode() which might return -1, meaning that
the syscall should be skipped. A tracer might have altered the
regs. More importantly, if it's -1 the kernel cannot change the
return value, because seccomp filtering might already done that.
3. If it's a valid syscall, perform the syscall.
Now some scenarios:
* A user does a valid syscall.
* A user does an invalid syscall(-1)
* A user does an invalid syscall(!= -1)
Then there're the tracing variants of those, and that's where we go get
trouble.
Now the bugs we've seen in RISC-V:
1. strace "tracing": Requires that regs->a0 is not tampered with prior
ptrace notification
E.g.:
| # ./strace /
| execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
| ./strace: exec: Permission denied
| +++ exited with 1 +++
| # ./disable_ptrace_get_syscall_info ./strace /
| execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
| ./strace: exec: Permission denied
| +++ exited with 1 +++
In the second case, arg0 is prematurely set to -ENOSYS
(0xffffffffffffffda).
2. strace "syscall tampering": Requires that ENOSYS is returned for
syscall(-1), and not skipped w/o a proper return value.
E.g.:
| ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
Here, strace expects that injecting -1, would result in a ENOSYS.
3. seccomp filtering: Requires that the a0 is not tampered to
First 3 was broken (tampering with a0 after seccomp), then 2 was
broken (not setting ENOSYS for -1), and finally 1 was broken (and
still is!).
Now for your patch:
Celeste Liu via B4 Relay <devnull+CoelacanthusHex.gmail.com at kernel.org>
writes:
> From: Celeste Liu <CoelacanthusHex at gmail.com>
>
> The return value of syscall_enter_from_user_mode() is always -1 when the
> syscall was filtered. We can't know whether syscall_nr is -1 when we get -1
> from syscall_enter_from_user_mode(). And the old syscall variable is
> unusable because syscall_enter_from_user_mode() may change a7 register.
> So get correct syscall number from syscall_get_nr().
>
> So syscall number part of return value of syscall_enter_from_user_mode()
> is completely useless. We can remove it from API and require caller to
> get syscall number from syscall_get_nr(). But this change affect more
> architectures and will block more time. So we split it into another
> patchset to avoid block this fix. (Other architectures can works
> without this change but riscv need it, see Link: tag below)
>
> Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS")
> Reported-by: Andrea Bolognani <abologna at redhat.com>
> Closes: https://github.com/strace/strace/issues/315
> Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
> Signed-off-by: Celeste Liu <CoelacanthusHex at gmail.com>
> ---
> arch/riscv/kernel/traps.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
> {
> if (user_mode(regs)) {
> long syscall = regs->a7;
> + long res;
>
> regs->epc += 4;
> regs->orig_a0 = regs->a0;
> - regs->a0 = -ENOSYS;
>
> riscv_v_vstate_discard(regs);
>
> - syscall = syscall_enter_from_user_mode(regs, syscall);
> + res = syscall_enter_from_user_mode(regs, syscall);
> + /*
> + * Call syscall_get_nr() again because syscall_enter_from_user_mode()
> + * may change a7 register.
> + */
> + syscall = syscall_get_nr(current, regs);
>
> add_random_kstack_offset();
>
> - if (syscall >= 0 && syscall < NR_syscalls)
> + if (syscall < 0 || syscall >= NR_syscalls)
> + regs->a0 = -ENOSYS;
> + else if (res != -1)
> syscall_handler(regs, syscall);
Here we can perform the syscall, even if res is -1. E.g., if this path
[2] is taken, we might have a valid syscall number in a7, but the
syscall should not be performed.
Also, one reason for the generic entry is so that it should be less
work. Here, you pull (IMO) details that belong to the common entry
implementation/API up the common entry user. Wdyt about pushing it down
to common entry? Something like:
--8<--
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 51ebfd23e007..66fded8e4b60 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -319,7 +319,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
regs->epc += 4;
regs->orig_a0 = regs->a0;
- regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
@@ -329,6 +328,8 @@ void do_trap_ecall_u(struct pt_regs *regs)
if (syscall >= 0 && syscall < NR_syscalls)
syscall_handler(regs, syscall);
+ else if (syscall != -1)
+ regs->a0 = -ENOSYS;
/*
* Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..9b69c2ad4f12 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -14,6 +14,7 @@
#include <linux/kmsan.h>
#include <asm/entry-common.h>
+#include <asm/syscall.h>
/*
* Define dummy _TIF work flags if not defined by the architecture or for
@@ -166,6 +167,8 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);
+ else if (syscall == -1L)
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
return syscall;
}
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5b6934e23c21..99742aee5002 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -43,8 +43,10 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
- if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
+ if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) {
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
return -1L;
+ }
}
/* Do seccomp after ptrace, to catch any tracer changes. */
@@ -66,6 +68,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
syscall = syscall_get_nr(current, regs);
}
+ /*
+ * If we're not setting the return value here, the context is
+ * gone; For higher callers, -1 means that the syscall should
+ * be skipped.
+ */
+ if (syscall == -1L)
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
+
syscall_enter_audit(regs, syscall);
return ret ? : syscall;
--8<--
I did a quick test of the above, and it seems to cover all the previous
bugs -- but who knows! ;-)
Björn
[1] https://lore.kernel.org/linux-riscv/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/entry/common.c#n47
More information about the linux-riscv
mailing list