nvme-tcp: i/o errors and stalled host from failure to send command pdus
Jonathan Nicklin
jnicklin at blockbridge.com
Tue Sep 6 06:36:49 PDT 2022
Not a problem. Is it possible to get this fix committed back into the last few
longterm kernels? It seems like anyone running nvme/tcp could benefit from
this fix.
On Mon, Sep 5, 2022 at 11:06 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 9/5/22 17:54, Jonathan Nicklin wrote:
> > On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi at grimberg.me> wrote:
> >>
> >>
> >>
> >> On 9/2/22 16:47, Jonathan Nicklin wrote:
> >>> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> >>> <jnicklin at blockbridge.com> wrote:
> >>>>
> >>>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> >>>> <jnicklin at blockbridge.com> wrote:
> >>>>>
> >>>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> >>>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
> >>>>> read without the send_mutex? queue_work_on() is conditionalized on this.
> >>>>>
> >>>>> Apologies in advance if I'm misunderstanding the code.
> >>>>>
> >>>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi at grimberg.me> wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Question: queue->more_requests is updated under a lock. However, not
> >>>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> >>>>>>> values seem to show that the same queue was being processed on
> >>>>>>> different CPUs. Is this a race between the kworker and inline
> >>>>>>> submission?
> >>>>>>
> >>>>>> The entire send path in io_work is wrapped with the send_mutex. Which
> >>>>>> part is not protected with the lock?
> >>>>
> >>>> Okay, maybe some progress?! What appears to be happening is that
> >>>> the request marked "last" is being picked up via nvme_tcp_send_all() when
> >>>> processing the previous command. Consider another trace:
> >>>>
> >>>> CPU
> >>>> tar-2263 [004] ..... 680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5e op=0 data_len=16384
> >>>> tar-2263 [004] ..... 680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5e op=0 page_offset=1352 send_len=72
> >>>> tar-2263 [004] ..... 680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5d op=0 data_len=32768
> >>>> tar-2263 [004] ..... 680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5d op=0 page_offset=1424 send_len=72
> >>>> tar-2263 [004] ..... 680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5c op=0 data_len=32768
> >>>> tar-2263 [004] ..... 680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5c op=0 page_offset=1496 send_len=72
> >>>> tar-2263 [004] .N... 680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5b op=0 data_len=32768
> >>>> tar-2263 [004] .N... 680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5b op=0 page_offset=1568 send_len=72
> >>>> kworker/4:0H-38 [004] ..... 680.671198: nvme_tcp_queue_rq: nvme1:
> >>>> qid=5 tag=5a op=0 data_len=16384
> >>>> tar-2263 [000] ..... 680.671199: nvme_tcp_cmd_pdu: nvme1:
> >>>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
> >>>>
> >>>> I've added some additional state to the request to print out on
> >>>> timeout. First, queue->more_requests now tracks the command tag
> >>>> that asserted more_requests. Second, I recorded the number of commands
> >>>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> >>>> issued, it ended up sending 0x5a as well. More_requests was asserted
> >>>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> >>>> actually pushed it out onto the wire.
> >>>>
> >>>> [ 709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>> empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>
> >> What is the difference between smp_id and smp_id2? where is each
> >> taken?
> >
> > smp_id2 is taken immediately after the direct send case (after unlock). It just
> > confirms that we were preempted and rescheduled.
> >
> > smp_id = raw_smp_processor_id();
> > if (queue->io_cpu == smp_id && sync && empty) {
> > lock = mutex_trylock(&queue->send_mutex);
> > if (lock) {
> > nvme_tcp_send_all(queue, &sent);
> > mutex_unlock(&queue->send_mutex);
> > }
> > }
> > smp_id2 = raw_smp_processor_id();
>
> Got it.
>
> >
> >>
> >>>> q_req_list 0 q_more_requests 0x5e req_sent 1
> >>>> [ 709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>> empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>> q_req_list 0 q_more_requests 0x5d req_sent 1
> >>>> [ 709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>> empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>> q_req_list 0 q_more_requests 0x5c req_sent 1
> >>>> [ 709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >>>> empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>> q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> >>>> [ 709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>> empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>> q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> >>>
> >>> I'm not quite sure of all of the mechanics of command submission... but it
> >>> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> >>> pile up on the req_list. When rescheduled, we drain the req_list with
> >>> more_reqests set, which prevents us from pushing the bits onto the wire.
> >>>
> >>> Take notice of the "N" flag in the traces and the fact that the
> >>> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> >>> the associated tag.
> >>
> >> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
> >> means it should have cleared NOLAST and push it all down the wire,
> >> however it had it set.
> >>
> >> What happens here is that 0x5b execution sets queue->more_requests=true
> >> for the lifetime of its execution and that is wrong.
> >>
> >> To prove the theory can you try this patch?
> >> --
> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >> index 3e3ebde4eff5..644def671df4 100644
> >> --- a/drivers/nvme/host/tcp.c
> >> +++ b/drivers/nvme/host/tcp.c
> >> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
> >> struct mutex send_mutex;
> >> struct llist_head req_list;
> >> struct list_head send_list;
> >> - bool more_requests;
> >>
> >> /* recv state */
> >> void *pdu;
> >> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
> >> nvme_tcp_queue *queue)
> >> static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
> >> {
> >> return !list_empty(&queue->send_list) ||
> >> - !llist_empty(&queue->req_list) || queue->more_requests;
> >> + !llist_empty(&queue->req_list);
> >> }
> >>
> >> static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> >> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
> >> nvme_tcp_request *req,
> >> */
> >> if (queue->io_cpu == raw_smp_processor_id() &&
> >> sync && empty && mutex_trylock(&queue->send_mutex)) {
> >> - queue->more_requests = !last;
> >> nvme_tcp_send_all(queue);
> >> - queue->more_requests = false;
> >> mutex_unlock(&queue->send_mutex);
> >> }
> >> --
> >>
> >> The downside here, is that commands now that are queued in a batch,
> >> will not have NOLAST set, which means that they will be sent directly
> >> to the wire...
> >>
> >> I'll need to think about the correct way to optimize for this case,
> >> however there is an argument that they might as well just be sent
> >> asap (i.e. give up batching).
> >>
> >> Alternatively, perhaps remember this in the lifetime of the command
> >> however this needs careful attention because network sends are also
> >> driven from r2t sent by the controller...
> >
> > I can confirm that ignoring queue->more_requests resolves the issue. I've run
> > about 300 iterations without issue. Pre-fix, the problem reproduced almost
> > immediately (1-3 runs).
>
> Thanks a lot Jonathan for reporting this issue and helping to resolve
> it. Appreciate the effort.
>
> I'll send a patch, can reply with your Tested-by tag?
More information about the Linux-nvme
mailing list