Race between SIGIO and epoll from SMP host

Wed Apr 21 12:53:19 BST 2021

Hi

I was playing around with xxd large files when I noticed that the
terminal output often stalls for a few seconds. It's a heisenbug so
the root cause was mostly found by debugging-by-printing and
trial-and-error.

This bug I can reliably reproduce by running `xxd /dev/zero` inside
UML that has its tty as `fd:0,fd:1`. Behavior: the terminal output
stalls in the middle of a line, stalling for seconds and possibly a
very long time. Keyboard input de-stalls the output but will stall
again quickly.

AFAICT, the way UML receives CONSOLE_WRITE_IRQ from stdout is that, it
sets O_ASYNC on stdout, then whenever a SIGIO is received by UML, it
checks for the origin of the SIGIO via epoll_wait in
os_waiting_for_events_epoll. However, this is racy: A SIGIO can be
generated prior to epoll receiving the event, so epoll_wait with zero
timeout may be called before epoll receives the event, and the event
can be missed.

As a minimal reproducible example, this C code running on the host
often assert failure in `res == 1`:

[snip]
#define _GNU_SOURCE 1

#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <signal.h>
#include <sys/epoll.h>
#include <sys/fcntl.h>
#include <unistd.h>

#define MAX_EPOLL_EVENTS 64
volatile sig_atomic_t state;
static struct epoll_event epoll_events[MAX_EPOLL_EVENTS];

static char msg[] =
    "00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................\n";

static void sigio_handler(int ignored)
{
    assert(state == 1 || state == 0);
    state = 2;
}

int main(int argc, char **argv)
{
    int epollfd, flags, res;
    struct epoll_event event;

    state = 0;

    signal(SIGIO, sigio_handler);

    epollfd = epoll_create(MAX_EPOLL_EVENTS);
    assert(epollfd >= 0);

    event.data.ptr = NULL;
    event.events = EPOLLOUT | EPOLLET;
    assert(!epoll_ctl(epollfd, EPOLL_CTL_ADD, 1, &event));

    flags = fcntl(1, F_GETFL);
    assert(flags >= 0);
    flags |= O_ASYNC | O_NONBLOCK;

    assert(!fcntl(1, F_SETFL, flags));
    assert(!fcntl(1, F_SETSIG, SIGIO));
    assert(!fcntl(1, F_SETOWN, getpid()));

    for (;;) {
        if (state == 0) {
            res = write(1, msg, sizeof(msg));
            if (res < 0 && errno == EAGAIN)
                state = 1;
        } else if (state == 1) {
            pause();
        } else if (state == 2) {
            res = epoll_wait(epollfd, epoll_events, MAX_EPOLL_EVENTS, 1);
            assert(res == 1);
            state = 0;
        }
    }
}
[/snip]

strace -e epoll_wait shows, the last few lines:

  --- SIGIO {si_signo=SIGIO, si_code=POLL_OUT, si_band=772} ---
  epoll_wait(3, [{events=EPOLLOUT, data={u32=0, u64=0}}], 64, 1) = 1
  --- SIGIO {si_signo=SIGIO, si_code=POLL_OUT, si_band=772} ---
  epoll_wait(3, [{events=EPOLLOUT, data={u32=0, u64=0}}], 64, 1) = 1
  --- SIGIO {si_signo=SIGIO, si_code=POLL_OUT, si_band=772} ---
  epoll_wait(3, [], 64, 1)                = 0
  --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=2033614,
si_uid=1000} ---
  +++ killed by SIGABRT (core dumped) +++

I reproduced this assertion failure on:
* a self-built 5.11.14 SMP vanilla kernel running on bare metal, with
terminal as a graphical application.
* a 4.19.0-0.bpo.14-amd64 Debian 4.19.171-2~deb9u1 SMP kernel with
terminal as ssh terminal.

Can still reproduce after setting timeout to 1 ms, in both in UML and
in this minimal example.

I looked into how SIGIO and epoll events are generated for a tty, and
it seems that SIGIO is generated strictly before epoll events are
generated. gdb backtrace looks like (yes, on UML; didn't bother to
spin up qemu since I had a UML gdb open):

  #0  send_sigio (fown=0x674a4b98, fd=0x1, band=0x2) at ./fs/fcntl.c:795
  #1  kill_fasync_rcu (band=<optimized out>, sig=<optimized out>,
fa=0x672f90c0) at ./fs/fcntl.c:1019
  #2  kill_fasync (fp=<optimized out>, sig=0x1d, band=0x2) at ./fs/fcntl.c:1033
  #3  n_tty_write_wakeup (tty=<optimized out>) at ./drivers/tty/n_tty.c:243
  #4  tty_wakeup (tty=0x67485800) at ./drivers/tty/tty_io.c:533
  #5  tty_port_default_wakeup (port=<optimized out>) at
./drivers/tty/tty_port.c:50
  #6  tty_port_tty_wakeup (port=0x615ded80 <vts>) at
./drivers/tty/tty_port.c:388

  #0  ep_poll_callback (wait=0x674d2ad0, mode=0x1, sync=0x0, key=0x4)
at ./fs/eventpoll.c:361
  #1  __wake_up_common (wq_head=0x67485ce8, mode=0x1,
nr_exclusive=0x1, wake_flags=0x0, key=0x4, bookmark=0x61396f70) at
./kernel/sched/wait.c:108
  #2  __wake_up_common_lock (wq_head=0x67485ce8, mode=0x1,
nr_exclusive=0x1, wake_flags=0x0, key=0x4) at
./kernel/sched/wait.c:138
  #3  __wake_up (wq_head=0x67485ce8, mode=0x1, nr_exclusive=0x1,
key=0x4) at ./kernel/sched/wait.c:157
  #4  tty_wakeup (tty=0x67485800) at ./drivers/tty/tty_io.c:537
  #5  tty_port_default_wakeup (port=<optimized out>) at
./drivers/tty/tty_port.c:50
  #6  tty_port_tty_wakeup (port=0x615ded80 <vts>) at
./drivers/tty/tty_port.c:388

The last shared frame is in tty_wakeup, and n_tty_write_wakeup is
called by `ld->ops->write_wakeup` whereas __wake_up is in the macro
wake_up_interruptible_poll, so send_sigio is called strictly before
ep_poll_callback for tty_wakeup.

I have not tested on a non-SMP host, but on an SMP host, SIGIO could
cause the waiting task to wake up on a different core, and the task
could invoke epoll_wait before the interrupt handler generates the
epoll event in ep_poll_callback. If this happens to UML, the interrupt
for UML to make the tty ready will be missed until the next SIGIO.

Considering that this is a race on the host, what would be the best
way to fix this?

YiFei Zhu