[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Tue May 24 02:34:37 PDT 2022

On 5/24/22 10:53, Sagi Grimberg wrote:
> 
>>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>>> is in the process of being sent. But when it didn't have a 
>>>>>>>> chance to be sent and we just overloaded our internal queuing we 
>>>>>>>> shouldn't be sending timeouts.
>>>>>>>
>>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>>> another bug
>>>>>>> that the same phenomenon happens with soft-iwarp? What would you 
>>>>>>> tell
>>>>>>> him/her?
>>>>>>
>>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>>
>>>>> It was just a theoretical question.
>>>>>
>>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>>> questioning if increasing the io_timeout to be unbound in case the
>>>>> network is congested, is the right solution for everyone instead of
>>>>> a particular case that can easily be solved with udev to make the
>>>>> io_timeout to be as high as needed.
>>>>>
>>>>> One can argue that this patchset is making nvme-tcp to basically
>>>>> ignore the device io_timeout in certain cases.
>>>>
>>>> Oh, yes, sure, that will happen.
>>>> What I'm actually arguing is the imprecise difference between 
>>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>>> and command timeouts in case of resource constraints on the driver 
>>>> implementing ->queue_rq().
>>>>
>>>> If there is a resource constrain driver is free to return 
>>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or 
>>>> accept the request (in which case there will be a timeout).
>>>
>>> There is no resource constraint. The driver sizes up the resources
>>> to be able to queue all the requests it is getting.
>>>
>>>> I could live with a timeout if that would just result in the command 
>>>> being retried. But in the case of nvme it results in a connection 
>>>> reset to boot, making customers really nervous that their system is 
>>>> broken.
>>>
>>> But how does the driver know that it is running in this environment that
>>> is completely congested? What I'm saying is that this is a specific use
>>> case that the solution can have negative side-effects for other common
>>> use-cases, because it is beyond the scope of the driver to handle.
>>>
>>> We can also trigger this condition with nvme-rdma.
>>>
>>> We could stay with this patch, but I'd argue that this might be the
>>> wrong thing to do in certain use-cases.
>>>
>> Right, okay.
>>
>> Arguably this is a workload corner case, and we might not want to fix 
>> this in the driver.
>>
>> _However_: do we need to do a controller reset in this case?
>> Shouldn't it be sufficient to just complete the command w/ timeout 
>> error and be done with it?
> 
> The question is what is special about this timeout vs. any other
> timeout?
> 
> pci attempts to abort the command before triggering a controller
> reset, Maybe we should also? although abort is not really reliable
> going on the admin queue...

I am not talking about NVMe abort.
I'm talking about this:

@@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
                 "queue %d: timeout request %#x type %d\n",
                 nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);

+       if (!list_empty(&req->entry)) {
+               nvme_tcp_complete_timed_out(rq);
+               return BLK_EH_DONE;
+       }
+
         if (ctrl->state != NVME_CTRL_LIVE) {
                 /*
                  * If we are resetting, connecting or deleting we should


as the command is still in the queue and NVMe abort don't enter the 
picture at all.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman