nvme-tcp: i/o errors and stalled host from failure to send command pdus

Sagi Grimberg sagi at grimberg.me
Mon Sep 5 01:49:59 PDT 2022



On 9/2/22 16:47, Jonathan Nicklin wrote:
> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> <jnicklin at blockbridge.com> wrote:
>>
>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
>> <jnicklin at blockbridge.com> wrote:
>>>
>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
>>> read without the send_mutex? queue_work_on() is conditionalized on this.
>>>
>>> Apologies in advance if I'm misunderstanding the code.
>>>
>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>>>>
>>>>
>>>>> Question: queue->more_requests is updated under a lock. However, not
>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
>>>>> values seem to show that the same queue was being processed on
>>>>> different CPUs. Is this a race between the kworker and inline
>>>>> submission?
>>>>
>>>> The entire send path in io_work is wrapped with the send_mutex. Which
>>>> part is not protected with the lock?
>>
>> Okay, maybe some progress?! What appears to be happening is that
>> the request marked "last" is being picked up via nvme_tcp_send_all() when
>> processing the previous command. Consider another trace:
>>
>>                   CPU
>> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5e op=0 data_len=16384
>> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5e op=0 page_offset=1352 send_len=72
>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5d op=0 data_len=32768
>> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5d op=0 page_offset=1424 send_len=72
>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5c op=0 data_len=32768
>> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5c op=0 page_offset=1496 send_len=72
>> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5b op=0 data_len=32768
>> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5b op=0 page_offset=1568 send_len=72
>> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
>> qid=5 tag=5a op=0 data_len=16384
>> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>>
>> I've added some additional state to the request to print out on
>> timeout. First, queue->more_requests now tracks the command tag
>> that asserted more_requests. Second, I recorded the number of commands
>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
>> issued, it ended up sending 0x5a as well. More_requests was asserted
>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
>> actually pushed it out onto the wire.
>>
>> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0

What is the difference between smp_id and smp_id2? where is each
taken?

>>   q_req_list 0 q_more_requests 0x5e req_sent 1
>> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5d req_sent 1
>> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5c req_sent 1
>> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
>> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> 
> I'm not quite sure of all of the mechanics of command submission... but it
> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> pile up on the req_list. When rescheduled, we drain the req_list with
> more_reqests set, which prevents us from pushing the bits onto the wire.
> 
> Take notice of the "N" flag in the traces and the fact that the
> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> the associated tag.

According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
means it should have cleared NOLAST and push it all down the wire,
however it had it set.

What happens here is that 0x5b execution sets queue->more_requests=true
for the lifetime of its execution and that is wrong.

To prove the theory can you try this patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3e3ebde4eff5..644def671df4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -121,7 +121,6 @@ struct nvme_tcp_queue {
         struct mutex            send_mutex;
         struct llist_head       req_list;
         struct list_head        send_list;
-       bool                    more_requests;

         /* recv state */
         void                    *pdu;
@@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct 
nvme_tcp_queue *queue)
  static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
  {
         return !list_empty(&queue->send_list) ||
-               !llist_empty(&queue->req_list) || queue->more_requests;
+               !llist_empty(&queue->req_list);
  }

  static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
@@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct 
nvme_tcp_request *req,
          */
         if (queue->io_cpu == raw_smp_processor_id() &&
             sync && empty && mutex_trylock(&queue->send_mutex)) {
-               queue->more_requests = !last;
                 nvme_tcp_send_all(queue);
-               queue->more_requests = false;
                 mutex_unlock(&queue->send_mutex);
         }
--

The downside here, is that commands now that are queued in a batch,
will not have NOLAST set, which means that they will be sent directly
to the wire...

I'll need to think about the correct way to optimize for this case,
however there is an argument that they might as well just be sent
asap (i.e. give up batching).

Alternatively, perhaps remember this in the lifetime of the command
however this needs careful attention because network sends are also
driven from r2t sent by the controller...



More information about the Linux-nvme mailing list