nvme-tcp: i/o errors and stalled host from failure to send command pdus

Wed Aug 31 09:04:42 PDT 2022

On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> > Hello,
>
> Hey Jonathan,
>
> > We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> > with kernel versions 5.15 through 5.19. Using a combination of tracing
> > events and network traces, we can see nvme_setup_cmd() trace events
> > that do not result in transmitted commands PDUs.
> >
> > Here's a snippet of the relevant portion of the nvme trace. The
> > following events are the last of tens of thousands of successully
> > executed reads...
> >
> > tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >
> > The following 5 commands are issued to nvme, but never make it to the
> > wire. The wire state was verified manually using tcpdump on the host and
> > trace events on the target. Perhaps the "need-resched" marker in the
> > trace and the context switch to the kernel worker are relevant
> > to the failure?
> >
> > tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >
> > The reads above eventually timeout and recover after reset. Here are
> > the related reset messages.
> >
> > [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> > flags 0x4084700 phys_seg 1 prio class 0
> > [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> > [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> > flags 0x4084700 phys_seg 1 prio class 0
> > [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> > [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> > flags 0x4084700 phys_seg 2 prio class 0
> > [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> > [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> > flags 0x4080700 phys_seg 2 prio class 0
> > [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> > flags 0x4084700 phys_seg 2 prio class 0
> >
> > To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> > patch from a ways back. If my understanding is correct,
> > nvme_tcp_cmd_pdu() is being called for the commands in question.
> >
> > I've isolated the failure to a scenario we can reproduce at will
> > in about a minute of runtime. Is there any additional debug that we can
> > provide to help locate the bug?
>
> Would be interesting to understand if for these requests,
> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> command and printed during the timeout expiration). If not, would be
> interesting to understand why.

Okay, here's a repro with a bit more context. I recorded the state
from nvme_tcp_queue_request() in the request and printed it out as you
suggested. I also added trace events for nvme_tcp_queue_rq() to make
it easier to match the state.

- lock is the return value of trylock() (if executed)
- more is the return value of nvme_tcp_queue_more(if executed)
- this_cpu is raw_smp_processor_id();

Here's the trace output for the failed requests with the associated
completion messages (from dmesg) stitched in.

  tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
    nvme1: qid=2 tag=5 op=0 data_len=32768
  tar-2153 [001] .....   536.383588: nvme_setup_cmd:
    nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
      cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
  dmesg.out           [  565.574700] nvme_tcp_queue_request()
    state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1

 tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
   nvme1: qid=2 tag=4 op=0 data_len=32768
 tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
   nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
     cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
 dmesg.out            [  565.540965] nvme_tcp_queue_request()
   state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1

kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
  nvme1: qid=2 tag=3 op=0 data_len=32768
kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
    cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmesg.out              [  565.490597] nvme_tcp_queue_request()
  state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1

kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
  nvme1: qid=2 tag=2 op=0 data_len=32768
kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
    cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmesg.out                   [  565.470687] nvme_tcp_queue_request()
  state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1

Does this shed any light on the issue?