[PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support

Peter Collingbourne pcc at google.com
Thu Dec 16 16:09:37 PST 2021


On Thu, Dec 16, 2021 at 5:05 AM Thomas Gleixner <tglx at linutronix.de> wrote:
>
> Peter,
>
> On Wed, Dec 15 2021 at 17:25, Peter Collingbourne wrote:
> > On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx at linutronix.de> wrote:
> >> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
> >> has completed, recalculates pending signals and goes out to user space
> >> with eventually pending signals.
> >>
> >> How is this supposed to be even remotely correct?
> >
> > Please see this paragraph from the documentation:
> >
> > When entering the kernel with a non-zero uaccess descriptor
> > address for a reason other than a syscall (for example, when
> > IPI'd due to an incoming asynchronous signal), any signals other
> > than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> > ``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> > initialized with ``sigfillset(set)``. This is to prevent incoming
> > signals from interfering with uaccess logging.
> >
> > I believe that we will also go out to userspace with pending signals
> > when one of the signals that came in was a masked (via sigprocmask)
> > asynchronous signal, so this is an expected state.
>
> Believe is not part of a technical analysis, believe belongs into the
> realm of religion.
>
> It's a fundamental difference whether the program masks signals itself
> or the kernel decides to do that just because.
>
> Pending signals, which are not masked by the process, have to be
> delivered _before_ returning to user space.
>
>     That's the expected behaviour. Period.
>
> Instrumentation which changes the semantics of the observed code is
> broken by definition.

The idea is that the uaccess descriptor address would be set to a
non-zero value inside the syscall wrapper, before performing the
syscall. Since the kernel will set the uaccess descriptor address to
zero before returning from a syscall, at no point should the caller of
the syscall wrapper be executing with a non-zero uaccess descriptor
address. At worst, a signal will be delivered to a task executing a
syscall wrapper a few instructions later than it would otherwise, but
that's not really important because the determination of the exact
delivery point of an asynchronous signal is fundamentally racy anyway.

Basically, it's as if the syscall wrapper did this:

// During task startup:
uint64_t addr;
prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);

// Wrapper for syscall "x"
int x(...) {
  sigset_t set, old_set;
  struct uaccess_descriptor desc = { ... };

  addr = &desc;
  // The following two statements implicitly occur atomically together
with setting addr:
  sigfillset(set);
  sigprocmask(SIG_SETMASK, set, old_set);

  syscall(__NR_x ...,);
  // The following two statements implicitly occur atomically together
with the syscall:
  sigprocmask(SIG_SETMASK, old_set, NULL);
  addr = 0;

  // Now the uaccesses for syscall "x" are logged to "desc".
}

Aside from the guarantees of atomicity, this really seems no different
from the kernel providing another API for setting the signal mask.

> >> So what is this pre/post exit loop code about? Handle something which
> >> cannot happen in the first place?
> >
> > The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
> > which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
> > used to set the uaccess descriptor address address to a non-zero
> > value. It is a different flag from UACCESS_BUFFER_EXIT. It is
> > certainly possible for the ENTRY flag to be set in your 2) above,
> > since that flag is not normally modified while inside the kernel.
>
> Let me try again. The logger is only active when:
>
>     1) PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR has set an address, which
>        sets UACCESS_BUFFER_ENTRY
>
>     2) The task enters the kernel via syscall and the syscall entry
>        observes UACCESS_BUFFER_ENTRY and sets UACCESS_BUFFER_EXIT
>
> because the log functions only log when UACCESS_BUFFER_EXIT is set.
>
> UACCESS_BUFFER_EXIT is cleared in the syscall exit path _before_ the
> exit to usermode loop is reached, which means signal delivery is _NOT_
> logged at all.

Right. It is not the intent to log uaccesses associated with signal
delivery. Only uaccesses that occur while handling the syscall itself
are logged.

> A non-syscall entry from user space - interrupt, exception, NMI - will
> _NOT_ set UACCESS_BUFFER_EXIT because it takes a different entry
> path. So when that non-syscall entry returns and delivers a signal then
> there is no logging.

Again, that's fine, there's no intent to log that.

> When the task has entered the kernel via a syscall and the kernel gets
> interrupted and that interruption raises a signal, then there is no
> signal delivery. The interrupt returns to kernel mode, which obviously
> does not go through exit_to_user_mode(). The raised signal is delivered
> when the task returns from the syscall to user mode, but that won't be
> logged because UACCESS_BUFFER_EXIT is already cleared before the exit to
> user mode loop is reached.
>
> See?

Perhaps there is a misunderstanding of the purpose of the signal
blocking with non-zero uaccess descriptor address. It isn't there
because we want to log anything about these signals. It's there
because we don't want a signal handler to be invoked between when we
arrange for the kernel to log the next syscall and when we issue the
syscall that we want to log, because that could lead to the signal
handler's syscalls being logged instead of the syscall that we intend
to log.

Consider the syscall wrapper for syscall "x" above, and imagine that
we didn't have the sigprocmask statements, and imagine that a signal
came in after storing &desc to addr but before the call to syscall.
Also imagine that the handler for that signal is unaware of uaccess
logging, so it just issues syscalls directly without touching addr.
Now the first syscall performed by the signal handler will be logged,
instead of the intended syscall "x", because the kernel will read the
uaccess descriptor intended for logging syscall "x" from addr when
entering the kernel for the signal handler's first syscall.

The kernel setting addr to 0 during the syscall is also necessary in
order for the kernel to continue processing signals normally once the
logged syscall has returned. Effectively any incoming signals are
queued until we have finished processing the logged syscall. Because
the kernel has set addr to 0, it refrains from blocking signals when
returning to userspace from the logged syscall, and therefore any
pending signals are delivered. Any syscalls that occur in any signal
handlers invoked via this signal delivery will not interfere with the
previously collected log for syscall "x", precisely because addr was
set to 0 by the kernel. If we left it up to userspace to set addr back
to 0, we would have the same problem as if we didn't have the
sigprocmask statements, but now the critical section extends until the
userspace program sets addr to 0. Furthermore, a userspace program
setting addr to 0 would not automatically cause the pending signals to
be delivered (because simply storing a value to memory from userspace
will not necessarily trigger a kernel entry), and signals could
therefore be left blocked for longer than expected (at least until the
next kernel entry).

> >> then we have a very differrent understanding of what documentation
> >> should provide.
> >
> > This was intended as interface documentation, so it doesn't go into
> > too many details. It could certainly be improved though by referencing
> > the user documentation, as I mentioned above.
>
> Explanations which are required to make the code understandable have to
> be in the code/kernel-doc comments and not in some disjunct place. This
> disjunct documentation is guaranteed to be out of date within no time.

Got it. From our discussion it's clear that the justification for the
design of the uaccess logging interface (especially the signal
handling parts) needs to be documented in the code in order to avoid
confusion.

Peter



More information about the linux-arm-kernel mailing list