nvme-tcp: i/o errors and stalled host from failure to send command pdus

Fri Sep 2 06:47:39 PDT 2022

On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
<jnicklin at blockbridge.com> wrote:
>
> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> <jnicklin at blockbridge.com> wrote:
> >
> > nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> > nvme_tcp_queue_request() is an example where queue->more_requests gets
> > read without the send_mutex? queue_work_on() is conditionalized on this.
> >
> > Apologies in advance if I'm misunderstanding the code.
> >
> > On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi at grimberg.me> wrote:
> > >
> > >
> > > > Question: queue->more_requests is updated under a lock. However, not
> > > > all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> > > > values seem to show that the same queue was being processed on
> > > > different CPUs. Is this a race between the kworker and inline
> > > > submission?
> > >
> > > The entire send path in io_work is wrapped with the send_mutex. Which
> > > part is not protected with the lock?
>
> Okay, maybe some progress?! What appears to be happening is that
> the request marked "last" is being picked up via nvme_tcp_send_all() when
> processing the previous command. Consider another trace:
>
>                  CPU
> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5e op=0 data_len=16384
> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5e op=0 page_offset=1352 send_len=72
> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5d op=0 data_len=32768
> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5d op=0 page_offset=1424 send_len=72
> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5c op=0 data_len=32768
> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5c op=0 page_offset=1496 send_len=72
> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5b op=0 data_len=32768
> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5b op=0 page_offset=1568 send_len=72
> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
> qid=5 tag=5a op=0 data_len=16384
> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>
> I've added some additional state to the request to print out on
> timeout. First, queue->more_requests now tracks the command tag
> that asserted more_requests. Second, I recorded the number of commands
> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> issued, it ended up sending 0x5a as well. More_requests was asserted
> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> actually pushed it out onto the wire.
>
> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5e req_sent 1
> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5d req_sent 1
> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5c req_sent 1
> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5b !!! req_sent 0

I'm not quite sure of all of the mechanics of command submission... but it
looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
pile up on the req_list. When rescheduled, we drain the req_list with
more_reqests set, which prevents us from pushing the bits onto the wire.

Take notice of the "N" flag in the traces and the fact that the
starting cpu "smp_id" is different from the ending cpu "smp_id2" for
the associated tag.