nvme-tcp: i/o errors and stalled host from failure to send command pdus

Thu Sep 1 08:37:39 PDT 2022

On Thu, Sep 1, 2022 at 9:22 AM Jonathan Nicklin
<jnicklin at blockbridge.com> wrote:
>
> On Thu, Sep 1, 2022 at 8:37 AM Jonathan Nicklin
> <jnicklin at blockbridge.com> wrote:
> >
> > On Wed, Aug 31, 2022 at 7:42 PM Sagi Grimberg <sagi at grimberg.me> wrote:
> > >
> > >
> > > >
> > > > Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> > > >
> > > > req->dbg.send = 1;
> > > > ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> > > > req->dbg.sendpage = ret;
> > > >
> > > >
> > > > nvme-trace.out
> > > >
> > > > tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=18 op=0 data_len=20480
> > > > tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=17 op=0 data_len=32768
> > > > tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=14 op=0 data_len=32768
> > > > kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=13 op=0 data_len=32768
> > > > kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=12 op=0 data_len=12288
> > > > tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> > > >   nvme1: qid=8 tag=69 op=0 data_len=20480
> > > >
> > > >
> > > > dmesg.out (reordered to match nvme-trace.out)
> > > >
> > > > [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > >
> > > Something is not adding up...
> > > lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> > > not able to send it to the wire?
> > >
> > > I'd look into this and add indication that it is pulled from the
> > > req_list, and in turn from the send_list. It looks like it is somehow
> > > picked out and silteny dropped, because as long as it is on the list(s)
> > > another context should have picked it up at some point...
> > >
> > > > [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > > [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > > [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> > > > [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> >
> > Ugh. It was misleading because the send state tracking was
> > re-initialized after the direct send case. Here's a better trace with
> > more information. It appears as though sendpage() is always called
> > with the NOTLAST flag set. I suppose this could mean that there is
> > some confusion or race with nvme_tcp_queue_more()?
> >
> > tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=28 op=0 data_len=16384
> > tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=27 op=0 data_len=16384
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=26 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=25 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=24 op=0 data_len=32768
> >
> >
> > [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>
> Additional tracing suggests it could be a race or issue with
> queue->more_requests.
>
> [  675.762509] nvme nvme1: state: tag 0x5f io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1
> [  675.720866] nvme nvme1: state: tag 0x5e io_cpu 1 smp_id 1 smp_id2 5 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1
> [  675.679256] nvme nvme1: state: tag 0x5d io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 1 q_req_list 0 q_more_requests 1
> [  675.621054] nvme nvme1: state: tag 0x5c io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>  q_send_list 1 q_req_list 0 q_more_requests 1
> [  675.593414] nvme nvme1: state: tag 0x5a io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1

Question: queue->more_requests is updated under a lock. However, not
all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
values seem to show that the same queue was being processed on
different CPUs. Is this a race between the kworker and inline
submission?