nvme-tcp: i/o errors and stalled host from failure to send command pdus

Mon Sep 5 08:06:30 PDT 2022

On 9/5/22 17:54, Jonathan Nicklin wrote:
> On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>>
>>
>>
>> On 9/2/22 16:47, Jonathan Nicklin wrote:
>>> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
>>> <jnicklin at blockbridge.com> wrote:
>>>>
>>>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
>>>> <jnicklin at blockbridge.com> wrote:
>>>>>
>>>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
>>>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
>>>>> read without the send_mutex? queue_work_on() is conditionalized on this.
>>>>>
>>>>> Apologies in advance if I'm misunderstanding the code.
>>>>>
>>>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>>>>>>
>>>>>>
>>>>>>> Question: queue->more_requests is updated under a lock. However, not
>>>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
>>>>>>> values seem to show that the same queue was being processed on
>>>>>>> different CPUs. Is this a race between the kworker and inline
>>>>>>> submission?
>>>>>>
>>>>>> The entire send path in io_work is wrapped with the send_mutex. Which
>>>>>> part is not protected with the lock?
>>>>
>>>> Okay, maybe some progress?! What appears to be happening is that
>>>> the request marked "last" is being picked up via nvme_tcp_send_all() when
>>>> processing the previous command. Consider another trace:
>>>>
>>>>                    CPU
>>>> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5e op=0 data_len=16384
>>>> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5e op=0 page_offset=1352 send_len=72
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5d op=0 data_len=32768
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5d op=0 page_offset=1424 send_len=72
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5c op=0 data_len=32768
>>>> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5c op=0 page_offset=1496 send_len=72
>>>> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5b op=0 data_len=32768
>>>> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5b op=0 page_offset=1568 send_len=72
>>>> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
>>>> qid=5 tag=5a op=0 data_len=16384
>>>> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
>>>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>>>>
>>>> I've added some additional state to the request to print out on
>>>> timeout. First, queue->more_requests now tracks the command tag
>>>> that asserted more_requests. Second, I recorded the number of commands
>>>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
>>>> issued, it ended up sending 0x5a as well. More_requests was asserted
>>>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
>>>> actually pushed it out onto the wire.
>>>>
>>>> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>
>> What is the difference between smp_id and smp_id2? where is each
>> taken?
> 
> smp_id2 is taken immediately after the direct send case (after unlock). It just
> confirms that we were preempted and rescheduled.
> 
> smp_id = raw_smp_processor_id();
> if (queue->io_cpu == smp_id && sync && empty) {
>          lock = mutex_trylock(&queue->send_mutex);
>          if (lock) {
>              nvme_tcp_send_all(queue, &sent);
>              mutex_unlock(&queue->send_mutex);
>          }
> }
> smp_id2 = raw_smp_processor_id();

Got it.

> 
>>
>>>>    q_req_list 0 q_more_requests 0x5e req_sent 1
>>>> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5d req_sent 1
>>>> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5c req_sent 1
>>>> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
>>>> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5b !!! req_sent 0
>>>
>>> I'm not quite sure of all of the mechanics of command submission... but it
>>> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
>>> pile up on the req_list. When rescheduled, we drain the req_list with
>>> more_reqests set, which prevents us from pushing the bits onto the wire.
>>>
>>> Take notice of the "N" flag in the traces and the fact that the
>>> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
>>> the associated tag.
>>
>> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
>> means it should have cleared NOLAST and push it all down the wire,
>> however it had it set.
>>
>> What happens here is that 0x5b execution sets queue->more_requests=true
>> for the lifetime of its execution and that is wrong.
>>
>> To prove the theory can you try this patch?
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 3e3ebde4eff5..644def671df4 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
>>           struct mutex            send_mutex;
>>           struct llist_head       req_list;
>>           struct list_head        send_list;
>> -       bool                    more_requests;
>>
>>           /* recv state */
>>           void                    *pdu;
>> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
>> nvme_tcp_queue *queue)
>>    static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>>    {
>>           return !list_empty(&queue->send_list) ||
>> -               !llist_empty(&queue->req_list) || queue->more_requests;
>> +               !llist_empty(&queue->req_list);
>>    }
>>
>>    static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>>            */
>>           if (queue->io_cpu == raw_smp_processor_id() &&
>>               sync && empty && mutex_trylock(&queue->send_mutex)) {
>> -               queue->more_requests = !last;
>>                   nvme_tcp_send_all(queue);
>> -               queue->more_requests = false;
>>                   mutex_unlock(&queue->send_mutex);
>>           }
>> --
>>
>> The downside here, is that commands now that are queued in a batch,
>> will not have NOLAST set, which means that they will be sent directly
>> to the wire...
>>
>> I'll need to think about the correct way to optimize for this case,
>> however there is an argument that they might as well just be sent
>> asap (i.e. give up batching).
>>
>> Alternatively, perhaps remember this in the lifetime of the command
>> however this needs careful attention because network sends are also
>> driven from r2t sent by the controller...
> 
> I can confirm that ignoring queue->more_requests resolves the issue. I've run
> about 300 iterations without issue. Pre-fix, the problem reproduced almost
> immediately (1-3 runs).

Thanks a lot Jonathan for reporting this issue and helping to resolve
it. Appreciate the effort.

I'll send a patch, can reply with your Tested-by tag?