nvme-tcp: i/o errors and stalled host from failure to send command pdus

Sagi Grimberg sagi at grimberg.me
Thu Sep 1 10:00:25 PDT 2022


>>> Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
>>>
>>> req->dbg.send = 1;
>>> ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
>>> req->dbg.sendpage = ret;
>>>
>>>
>>> nvme-trace.out
>>>
>>> tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=18 op=0 data_len=20480
>>> tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=17 op=0 data_len=32768
>>> tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=14 op=0 data_len=32768
>>> kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=13 op=0 data_len=32768
>>> kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=12 op=0 data_len=12288
>>> tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
>>>    nvme1: qid=8 tag=69 op=0 data_len=20480
>>>
>>>
>>> dmesg.out (reordered to match nvme-trace.out)
>>>
>>> [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>
>> Something is not adding up...
>> lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
>> not able to send it to the wire?
>>
>> I'd look into this and add indication that it is pulled from the
>> req_list, and in turn from the send_list. It looks like it is somehow
>> picked out and silteny dropped, because as long as it is on the list(s)
>> another context should have picked it up at some point...
>>
>>> [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>> [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>> [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 0 lock 0 last 0 more 1 send 1 sendpage 72
>>> [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> 
> Ugh. It was misleading because the send state tracking was
> re-initialized after the direct send case. Here's a better trace with
> more information. It appears as though sendpage() is always called
> with the NOTLAST flag set. I suppose this could mean that there is
> some confusion or race with nvme_tcp_queue_more()?

I don't know yet...

> tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=28 op=0 data_len=16384
> tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=27 op=0 data_len=16384
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=26 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=25 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=24 op=0 data_len=32768
> 
> 
> [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040

Here lock=1, which means that we send inline, however last=0, which
means that more requests are coming and corresponds with NOLAST flag.

> [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040

here lock=0, but last=0 as well, so we add the request to req_list and
_not_ trigger io_work, again because more requests are coming.

> [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040

Here lock=0, last=1, and we have requests that are queued (more=1), so
we add the the req_list, and schedule io_work. Now sending the last
3 commands happens asynchronously, so by the time that we get to send
to the wire, it is possible that more requests got queued, which
may explain why NOLAST flag is set...

So we cannot know for sure... You can add what is nvme_tcp_queue_more
at the time the flag is determined, but according to the code, it is
set if more requests are queued in req_list/send_list.

At some point, we'd expect to send with a request that has NOLAST
flag cleared. Would be interesting to make sure that we are getting
there.

BTW, is this a read-only workload? filesystem? can you share more
about the workload?



More information about the Linux-nvme mailing list