nvme-tcp: i/o errors and stalled host from failure to send command pdus

Jonathan Nicklin jnicklin at blockbridge.com
Wed Aug 31 12:06:04 PDT 2022


Something about the logged state didn't make sense to me. Sure enough,
I had a bug in the state tracking. Here's another attempt.

I also added in"smp_id2", which is raw_smp_processor_id() recorded
immediately after the direct send case. Is it interesting to point out that the
CPU appears to change for the first command to go missing.

Here's the arrangement of smp_id and smp_id2.

    smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            queue->more_requests = !last;
            nvme_tcp_send_all(queue);
            queue->more_requests = false;
            mutex_unlock(&queue->send_mutex);
        }
}
    smp_id2 = raw_smp_processor_id();

Here's the new trace with dmesg outputs:

tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2f op=0 data_len=32768
tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.626849] nvme nvme1:
 state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
                 ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^

kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2e op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.593681] nvme nvme1:
 state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2d op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmest.out [  459.560506] nvme nvme1:
 state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=29 op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmest.out [  459.540904] nvme nvme1:
 state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1


On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 8/31/22 19:04, Jonathan Nicklin wrote:
> > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi at grimberg.me> wrote:
> >>
> >>
> >>> Hello,
> >>
> >> Hey Jonathan,
> >>
> >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> >>> events and network traces, we can see nvme_setup_cmd() trace events
> >>> that do not result in transmitted commands PDUs.
> >>>
> >>> Here's a snippet of the relevant portion of the nvme trace. The
> >>> following events are the last of tens of thousands of successully
> >>> executed reads...
> >>>
> >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The following 5 commands are issued to nvme, but never make it to the
> >>> wire. The wire state was verified manually using tcpdump on the host and
> >>> trace events on the target. Perhaps the "need-resched" marker in the
> >>> trace and the context switch to the kernel worker are relevant
> >>> to the failure?
> >>>
> >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The reads above eventually timeout and recover after reset. Here are
> >>> the related reset messages.
> >>>
> >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> >>> flags 0x4080700 phys_seg 2 prio class 0
> >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>>
> >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> >>> patch from a ways back. If my understanding is correct,
> >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> >>>
> >>> I've isolated the failure to a scenario we can reproduce at will
> >>> in about a minute of runtime. Is there any additional debug that we can
> >>> provide to help locate the bug?
> >>
> >> Would be interesting to understand if for these requests,
> >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> >> command and printed during the timeout expiration). If not, would be
> >> interesting to understand why.
> >
> > Okay, here's a repro with a bit more context. I recorded the state
> > from nvme_tcp_queue_request() in the request and printed it out as you
> > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > it easier to match the state.
> >
> > - lock is the return value of trylock() (if executed)
> > - more is the return value of nvme_tcp_queue_more(if executed)
> > - this_cpu is raw_smp_processor_id();
> >
> > Here's the trace output for the failed requests with the associated
> > completion messages (from dmesg) stitched in.
> >
> >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> >      nvme1: qid=2 tag=5 op=0 data_len=32768
> >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
>
> This is strange...
> cpu matches, and the lock is taken, this means that the request should
> have been sent immediately to the wire...
>
> However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> which means the network stack will not send it right away...
>
> >
> >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> >     nvme1: qid=2 tag=4 op=0 data_len=32768
> >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
>
> Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
>
> > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
>
> Here, the queue is not empty, so we are not sending right away, and its
> also not the last in the batch, so we are also not queueing io_work, we
> will wait for the last request in the batch to schedule io_work, to send
> the batch to the network.
>
> > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
>
> Here again, the queue is not empty, and its the last request in the
> batch, so we should have scheduled io_work to send the queued requests
> in the send_list along.
>
> These are not necessarily requests that were one after the other,
> however the last request, should have flushed all the prior ones...
>
> Can you add an indication that a command was actually sent to the wire
> (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).

Something about the logged state didn't make sense to me. Sure enough,
I had a bug in the state tracking. Here's another attempt.

I also added in"smp_id2", which is raw_smp_processor_id() recorded
immediately after the direct send case. Is it interesting to point out that the
CPU appears to change for the first command to go missing.

Here's the arrangement of smp_id and smp_id2.

    smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            queue->more_requests = !last;
            nvme_tcp_send_all(queue);
            queue->more_requests = false;
            mutex_unlock(&queue->send_mutex);
        }
}
    smp_id2 = raw_smp_processor_id();

Here's the new trace with dmesg outputs:

tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2f op=0 data_len=32768
tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.626849] nvme nvme1:
 state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
                 ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^

kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2e op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.593681] nvme nvme1:
 state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2d op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmest.out [  459.560506] nvme nvme1:
 state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=29 op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmest.out [  459.540904] nvme nvme1:
 state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1


On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 8/31/22 19:04, Jonathan Nicklin wrote:
> > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi at grimberg.me> wrote:
> >>
> >>
> >>> Hello,
> >>
> >> Hey Jonathan,
> >>
> >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> >>> events and network traces, we can see nvme_setup_cmd() trace events
> >>> that do not result in transmitted commands PDUs.
> >>>
> >>> Here's a snippet of the relevant portion of the nvme trace. The
> >>> following events are the last of tens of thousands of successully
> >>> executed reads...
> >>>
> >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The following 5 commands are issued to nvme, but never make it to the
> >>> wire. The wire state was verified manually using tcpdump on the host and
> >>> trace events on the target. Perhaps the "need-resched" marker in the
> >>> trace and the context switch to the kernel worker are relevant
> >>> to the failure?
> >>>
> >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The reads above eventually timeout and recover after reset. Here are
> >>> the related reset messages.
> >>>
> >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> >>> flags 0x4080700 phys_seg 2 prio class 0
> >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>>
> >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> >>> patch from a ways back. If my understanding is correct,
> >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> >>>
> >>> I've isolated the failure to a scenario we can reproduce at will
> >>> in about a minute of runtime. Is there any additional debug that we can
> >>> provide to help locate the bug?
> >>
> >> Would be interesting to understand if for these requests,
> >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> >> command and printed during the timeout expiration). If not, would be
> >> interesting to understand why.
> >
> > Okay, here's a repro with a bit more context. I recorded the state
> > from nvme_tcp_queue_request() in the request and printed it out as you
> > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > it easier to match the state.
> >
> > - lock is the return value of trylock() (if executed)
> > - more is the return value of nvme_tcp_queue_more(if executed)
> > - this_cpu is raw_smp_processor_id();
> >
> > Here's the trace output for the failed requests with the associated
> > completion messages (from dmesg) stitched in.
> >
> >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> >      nvme1: qid=2 tag=5 op=0 data_len=32768
> >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
>
> This is strange...
> cpu matches, and the lock is taken, this means that the request should
> have been sent immediately to the wire...
>
> However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> which means the network stack will not send it right away...
>
> >
> >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> >     nvme1: qid=2 tag=4 op=0 data_len=32768
> >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
>
> Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
>
> > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
>
> Here, the queue is not empty, so we are not sending right away, and its
> also not the last in the batch, so we are also not queueing io_work, we
> will wait for the last request in the batch to schedule io_work, to send
> the batch to the network.
>
> > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
>
> Here again, the queue is not empty, and its the last request in the
> batch, so we should have scheduled io_work to send the queued requests
> in the send_list along.
>
> These are not necessarily requests that were one after the other,
> however the last request, should have flushed all the prior ones...
>
> Can you add an indication that a command was actually sent to the wire
> (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).



More information about the Linux-nvme mailing list