nvme-tcp: i/o errors and stalled host from failure to send command pdus

Wed Aug 31 02:02:52 PDT 2022

> Hello,

Hey Jonathan,

> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> with kernel versions 5.15 through 5.19. Using a combination of tracing
> events and network traces, we can see nvme_setup_cmd() trace events
> that do not result in transmitted commands PDUs.
> 
> Here's a snippet of the relevant portion of the nvme trace. The
> following events are the last of tens of thousands of successully
> executed reads...
> 
> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> 
> The following 5 commands are issued to nvme, but never make it to the
> wire. The wire state was verified manually using tcpdump on the host and
> trace events on the target. Perhaps the "need-resched" marker in the
> trace and the context switch to the kernel worker are relevant
> to the failure?
> 
> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> 
> The reads above eventually timeout and recover after reset. Here are
> the related reset messages.
> 
> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> flags 0x4084700 phys_seg 1 prio class 0
> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> flags 0x4084700 phys_seg 1 prio class 0
> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> flags 0x4084700 phys_seg 2 prio class 0
> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> flags 0x4080700 phys_seg 2 prio class 0
> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> flags 0x4084700 phys_seg 2 prio class 0
> 
> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> patch from a ways back. If my understanding is correct,
> nvme_tcp_cmd_pdu() is being called for the commands in question.
> 
> I've isolated the failure to a scenario we can reproduce at will
> in about a minute of runtime. Is there any additional debug that we can
> provide to help locate the bug?

Would be interesting to understand if for these requests,
nvme_tcp_queue_request scheduled io_work or not (can be marked on the
command and printed during the timeout expiration). If not, would be
interesting to understand why.