[PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

Mon Nov 22 21:17:54 PST 2021

On Sat, Sep 25, 2021 at 7:20 PM Kees Cook <keescook at chromium.org> wrote:
>
> On Fri, Sep 24, 2021 at 02:50:04PM -0700, Peter Collingbourne wrote:
> > On Wed, Sep 22, 2021 at 8:59 AM Jann Horn <jannh at google.com> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <keescook at chromium.org> wrote:
> > > > On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > > > > Peter Collingbourne <pcc at google.com> writes:
> > > > > > This patch introduces a kernel feature known as uaccess logging.
> > > > > > [...]
> > > > > [...]
> > > > > How is logging the kernel's activity like this not a significant
> > > > > information leak?  How is this safe for unprivileged users?
> > > > [...]
> > > > Regardless, this is a pretty useful tool for this kind of fuzzing.
> > > > Perhaps the timing exposure could be mitigated by having the kernel
> > > > collect the record in a separate kernel-allocated buffer and flush the
> > > > results to userspace at syscall exit? (This would solve the
> > > > copy_to_user() recursion issue too.)
> >
> > Seems reasonable. I suppose that in terms of timing information we're
> > already (unavoidably) exposing how long the syscall took overall, and
> > we probably shouldn't deliberately expose more than that.
>
> Right -- I can't think of anything that can really use this today,
> but it very much feels like the kind of information that could aid in
> a timing race.

Okay, this now goes via a kernel-allocated buffer.

> > That being said, I'm wondering if that has security implications on
> > its own if it's then possible for userspace to manipulate the kernel
> > into allocating a large buffer (either at prctl() time or as a result
> > of getting the kernel to do a large number of uaccesses). Perhaps it
> > can be mitigated by limiting the size of the uaccess buffer provided
> > at prctl() time.
>
> There are a lot of exact-size allocation controls already (which I think
> is an unavoidable but separate issue[1]), but perhaps this could be
> mitigated by making the reserved buffer be PAGE_SIZE granular?

I was more thinking about userspace causing a kernel OOM or something
by making the kernel allocate large buffers. I decided to mitigate it
by putting an upper limit on the size of the kernel-side buffer.

Since it sounds like exact-size allocations are a pre-existing issue
we probably don't need to do anything about them at this time.

> > > One aspect that might benefit from some clarification on intended
> > > behavior is: what should happen if there are BPF tracing programs
> > > running (possibly as part of some kind of system-wide profiling or
> > > such) that poke around in userspace memory with BPF's uaccess helpers
> > > (especially "bpf_copy_from_user")?
> >
> > I think we should probably be ignoring those accesses, since we cannot
> > know a priori whether the accesses are directly associated with the
> > syscall or not, and this is after all a best-effort mechanism.
>
> Perhaps the "don't log this uaccess" flag I suggested could be
> repurposed by BPF too, as a general "make this access invisible to
> PR_LOG_UACCESS" flag? i.e. this bit:

Since we ended up not needing this flag (because of the kernel-side
buffer) I ended up just making BPF use raw_copy_from_user().

> > > > Instead of reimplementing copy_*_user() with a new wrapper that
> > > > bypasses some checks and adds others and has to stay in sync, etc,
> > > > how about just adding a "recursion" flag? Something like:
> > > >
> > > >     copy_from_user(...)
> > > >         instrument_copy_from_user(...)
> > > >             uaccess_buffer_log_read(...)
> > > >                 if (current->uaccess_buffer.writing)
> > > >                     return;
> > > >                 uaccess_buffer_log(...)
> > > >                     current->uaccess_buffer.writing = true;
> > > >                     copy_to_user(...)
> > > >                     current->uaccess_buffer.writing = false;
>
>
>
> > > > This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> > > > program wants to collect the results after every syscall. And maybe this
> > > > won't make any sense across exec (losing the mm that was used during
> > > > SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.
> > >
> > > And then I guess your plan would be that userspace would be expected
> > > to use the userspace instruction pointer
> > > (seccomp_data::instruction_pointer) to indicate instructions that
> > > should be traced?
>
> That could be one way -- but seccomp filters would allow a bunch of
> ways.
>
> > >
> > > Or instead of seccomp, you could do it kinda like
> > > https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html
> > > , with a prctl that specifies a specific instruction pointer?
> >
> > Given a choice between these two options, I would prefer the prctl()
> > because userspace programs may already be using seccomp filters and
> > sanitizers shouldn't interfere with it.
>
> That's fair -- the "I wish we could make complex decisions about which
> syscalls to act on" sounds like seccomp.
>
> > However, in either the seccomp filter or prctl() case, you still have
> > the problem of deciding where to log to. Keep in mind that you would
> > need to prevent intervening async signals (that occur between when the
> > syscall happens and when we read the log) from triggering additional
>
> Could the sig handler also set the "make the uaccess invisible" flag?
> (It would need to be a "depth" flag, most likely.)

It's more complicated than that because you can longjmp() out of a
signal handler and that won't necessarily call sigreturn(). The kernel
doesn't really have a concept of "depth" as applied to signal
handlers, it's all managed on the userspace stack.

I brainstormed this with Dmitry a bit out of band and we came up with
a nice solution that avoids the two syscalls, is arch-generic and
avoids the problem with asynchronous signal handlers. I'll paste a bit
from the documentation that I wrote, but please see the full
documentation in v2 patch 5/5 for more details.

The feature may be used via the following prctl:

.. code-block:: c

  uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
  prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);

Supplying a non-zero address as the second argument to ``prctl``
will cause the kernel to read an address from that address on each
kernel entry (referred to as the *uaccess descriptor address*).

When entering the kernel to handle a syscall with a non-zero uaccess
descriptor address, the kernel will read a data structure of type
``struct uaccess_descriptor`` from the uaccess descriptor address,
which is defined as follows:

.. code-block:: c

  struct uaccess_descriptor {
    uint64_t addr, size;
  };

This data structure contains the address and size (in array elements)
of a *uaccess buffer*, which is an array of data structures of type
``struct uaccess_buffer_entry``. Before returning to userspace, the
kernel will log information about uaccesses to sequential entries
in the uaccess buffer. It will also store ``NULL`` to the uaccess
descriptor address, and store the address and size of the unused
portion of the uaccess buffer to the uaccess descriptor.

[...]

When entering the kernel for a reason other than a syscall (for
example, when IPI'd due to an incoming asynchronous signal) with
a non-zero uaccess descriptor address, any signals other
than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
initialized with ``sigfillset(set)``. This is to prevent incoming
signals from interfering with uaccess logging.

Peter