nvme-tcp: i/o errors and stalled host from failure to send command pdus

Mon Sep 5 07:54:06 PDT 2022

On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 9/2/22 16:47, Jonathan Nicklin wrote:
> > On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> > <jnicklin at blockbridge.com> wrote:
> >>
> >> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> >> <jnicklin at blockbridge.com> wrote:
> >>>
> >>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> >>> nvme_tcp_queue_request() is an example where queue->more_requests gets
> >>> read without the send_mutex? queue_work_on() is conditionalized on this.
> >>>
> >>> Apologies in advance if I'm misunderstanding the code.
> >>>
> >>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi at grimberg.me> wrote:
> >>>>
> >>>>
> >>>>> Question: queue->more_requests is updated under a lock. However, not
> >>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> >>>>> values seem to show that the same queue was being processed on
> >>>>> different CPUs. Is this a race between the kworker and inline
> >>>>> submission?
> >>>>
> >>>> The entire send path in io_work is wrapped with the send_mutex. Which
> >>>> part is not protected with the lock?
> >>
> >> Okay, maybe some progress?! What appears to be happening is that
> >> the request marked "last" is being picked up via nvme_tcp_send_all() when
> >> processing the previous command. Consider another trace:
> >>
> >>                   CPU
> >> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5e op=0 data_len=16384
> >> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5e op=0 page_offset=1352 send_len=72
> >> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5d op=0 data_len=32768
> >> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5d op=0 page_offset=1424 send_len=72
> >> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5c op=0 data_len=32768
> >> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5c op=0 page_offset=1496 send_len=72
> >> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5b op=0 data_len=32768
> >> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5b op=0 page_offset=1568 send_len=72
> >> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
> >> qid=5 tag=5a op=0 data_len=16384
> >> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
> >> qid=5 tag=5a op=0 page_offset=1640 send_len=72
> >>
> >> I've added some additional state to the request to print out on
> >> timeout. First, queue->more_requests now tracks the command tag
> >> that asserted more_requests. Second, I recorded the number of commands
> >> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> >> issued, it ended up sending 0x5a as well. More_requests was asserted
> >> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> >> actually pushed it out onto the wire.
> >>
> >> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
>
> What is the difference between smp_id and smp_id2? where is each
> taken?

smp_id2 is taken immediately after the direct send case (after unlock). It just
confirms that we were preempted and rescheduled.

smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            nvme_tcp_send_all(queue, &sent);
            mutex_unlock(&queue->send_mutex);
        }
}
smp_id2 = raw_smp_processor_id();

>
> >>   q_req_list 0 q_more_requests 0x5e req_sent 1
> >> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5d req_sent 1
> >> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5c req_sent 1
> >> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> >> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> >
> > I'm not quite sure of all of the mechanics of command submission... but it
> > looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> > pile up on the req_list. When rescheduled, we drain the req_list with
> > more_reqests set, which prevents us from pushing the bits onto the wire.
> >
> > Take notice of the "N" flag in the traces and the fact that the
> > starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> > the associated tag.
>
> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
> means it should have cleared NOLAST and push it all down the wire,
> however it had it set.
>
> What happens here is that 0x5b execution sets queue->more_requests=true
> for the lifetime of its execution and that is wrong.
>
> To prove the theory can you try this patch?
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3e3ebde4eff5..644def671df4 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
>          struct mutex            send_mutex;
>          struct llist_head       req_list;
>          struct list_head        send_list;
> -       bool                    more_requests;
>
>          /* recv state */
>          void                    *pdu;
> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
> nvme_tcp_queue *queue)
>   static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>   {
>          return !list_empty(&queue->send_list) ||
> -               !llist_empty(&queue->req_list) || queue->more_requests;
> +               !llist_empty(&queue->req_list);
>   }
>
>   static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
> nvme_tcp_request *req,
>           */
>          if (queue->io_cpu == raw_smp_processor_id() &&
>              sync && empty && mutex_trylock(&queue->send_mutex)) {
> -               queue->more_requests = !last;
>                  nvme_tcp_send_all(queue);
> -               queue->more_requests = false;
>                  mutex_unlock(&queue->send_mutex);
>          }
> --
>
> The downside here, is that commands now that are queued in a batch,
> will not have NOLAST set, which means that they will be sent directly
> to the wire...
>
> I'll need to think about the correct way to optimize for this case,
> however there is an argument that they might as well just be sent
> asap (i.e. give up batching).
>
> Alternatively, perhaps remember this in the lifetime of the command
> however this needs careful attention because network sends are also
> driven from r2t sent by the controller...

I can confirm that ignoring queue->more_requests resolves the issue. I've run
about 300 iterations without issue. Pre-fix, the problem reproduced almost
immediately (1-3 runs).