nvme-tcp: i/o errors and stalled host from failure to send command pdus

Thu Sep 1 10:39:42 PDT 2022

Of course! The workload is a test from a suite of data consistency tests
we run. The test walk's a directory and generates a checksum of the
contents. So, it is pretty much read-only. The underlying filesystem
is XFS. The kernel is 5.19.5. To repro, I'm just running the following
in a loop.

umount /mnt/nvme
mount /dev/nvme1n1 /mnt/nvme
find /mnt/nvme/linux-5.18.3 | LC_ALL=C sort | sudo tar -Pcf - -T -
--no-recursion | md5sum

This is on a 4-core intel CPU, with hyperthreading (4+4), 100G
Mellanox (all interrupts bound to core0)

On Thu, Sep 1, 2022 at 1:00 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> >>> Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> >>>
> >>> req->dbg.send = 1;
> >>> ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> >>> req->dbg.sendpage = ret;
> >>>
> >>>
> >>> nvme-trace.out
> >>>
> >>> tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=18 op=0 data_len=20480
> >>> tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=17 op=0 data_len=32768
> >>> tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=14 op=0 data_len=32768
> >>> kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=13 op=0 data_len=32768
> >>> kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=12 op=0 data_len=12288
> >>> tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> >>>    nvme1: qid=8 tag=69 op=0 data_len=20480
> >>>
> >>>
> >>> dmesg.out (reordered to match nvme-trace.out)
> >>>
> >>> [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>
> >> Something is not adding up...
> >> lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> >> not able to send it to the wire?
> >>
> >> I'd look into this and add indication that it is pulled from the
> >> req_list, and in turn from the send_list. It looks like it is somehow
> >> picked out and silteny dropped, because as long as it is on the list(s)
> >> another context should have picked it up at some point...
> >>
> >>> [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>> [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>> [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> >>> [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> >
> > Ugh. It was misleading because the send state tracking was
> > re-initialized after the direct send case. Here's a better trace with
> > more information. It appears as though sendpage() is always called
> > with the NOTLAST flag set. I suppose this could mean that there is
> > some confusion or race with nvme_tcp_queue_more()?
>
> I don't know yet...
>
> > tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=28 op=0 data_len=16384
> > tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=27 op=0 data_len=16384
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=26 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=25 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=24 op=0 data_len=32768
> >
> >
> > [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>
> Here lock=1, which means that we send inline, however last=0, which
> means that more requests are coming and corresponds with NOLAST flag.
>
> > [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>
> here lock=0, but last=0 as well, so we add the request to req_list and
> _not_ trigger io_work, again because more requests are coming.
>
> > [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>
> Here lock=0, last=1, and we have requests that are queued (more=1), so
> we add the the req_list, and schedule io_work. Now sending the last
> 3 commands happens asynchronously, so by the time that we get to send
> to the wire, it is possible that more requests got queued, which
> may explain why NOLAST flag is set...
>
> So we cannot know for sure... You can add what is nvme_tcp_queue_more
> at the time the flag is determined, but according to the code, it is
> set if more requests are queued in req_list/send_list.
>
> At some point, we'd expect to send with a request that has NOLAST
> flag cleared. Would be interesting to make sure that we are getting
> there.
>
> BTW, is this a read-only workload? filesystem? can you share more
> about the workload?