[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Tue May 24 02:58:49 PDT 2022

On 5/24/22 12:34, Hannes Reinecke wrote:
> On 5/24/22 10:53, Sagi Grimberg wrote:
>>
>>>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>>>> is in the process of being sent. But when it didn't have a 
>>>>>>>>> chance to be sent and we just overloaded our internal queuing 
>>>>>>>>> we shouldn't be sending timeouts.
>>>>>>>>
>>>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>>>> another bug
>>>>>>>> that the same phenomenon happens with soft-iwarp? What would you 
>>>>>>>> tell
>>>>>>>> him/her?
>>>>>>>
>>>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>>>
>>>>>> It was just a theoretical question.
>>>>>>
>>>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>>>> questioning if increasing the io_timeout to be unbound in case the
>>>>>> network is congested, is the right solution for everyone instead of
>>>>>> a particular case that can easily be solved with udev to make the
>>>>>> io_timeout to be as high as needed.
>>>>>>
>>>>>> One can argue that this patchset is making nvme-tcp to basically
>>>>>> ignore the device io_timeout in certain cases.
>>>>>
>>>>> Oh, yes, sure, that will happen.
>>>>> What I'm actually arguing is the imprecise difference between 
>>>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>>>> and command timeouts in case of resource constraints on the driver 
>>>>> implementing ->queue_rq().
>>>>>
>>>>> If there is a resource constrain driver is free to return 
>>>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or 
>>>>> accept the request (in which case there will be a timeout).
>>>>
>>>> There is no resource constraint. The driver sizes up the resources
>>>> to be able to queue all the requests it is getting.
>>>>
>>>>> I could live with a timeout if that would just result in the 
>>>>> command being retried. But in the case of nvme it results in a 
>>>>> connection reset to boot, making customers really nervous that 
>>>>> their system is broken.
>>>>
>>>> But how does the driver know that it is running in this environment 
>>>> that
>>>> is completely congested? What I'm saying is that this is a specific use
>>>> case that the solution can have negative side-effects for other common
>>>> use-cases, because it is beyond the scope of the driver to handle.
>>>>
>>>> We can also trigger this condition with nvme-rdma.
>>>>
>>>> We could stay with this patch, but I'd argue that this might be the
>>>> wrong thing to do in certain use-cases.
>>>>
>>> Right, okay.
>>>
>>> Arguably this is a workload corner case, and we might not want to fix 
>>> this in the driver.
>>>
>>> _However_: do we need to do a controller reset in this case?
>>> Shouldn't it be sufficient to just complete the command w/ timeout 
>>> error and be done with it?
>>
>> The question is what is special about this timeout vs. any other
>> timeout?
>>
>> pci attempts to abort the command before triggering a controller
>> reset, Maybe we should also? although abort is not really reliable
>> going on the admin queue...
> 
> I am not talking about NVMe abort.
> I'm talking about this:
> 
> @@ -2335,6 +2340,11 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
>                  "queue %d: timeout request %#x type %d\n",
>                  nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
> 
> +       if (!list_empty(&req->entry)) {
> +               nvme_tcp_complete_timed_out(rq);
> +               return BLK_EH_DONE;
> +       }
> +
>          if (ctrl->state != NVME_CTRL_LIVE) {
>                  /*
>                   * If we are resetting, connecting or deleting we should
> 
> 
> as the command is still in the queue and NVMe abort don't enter the 
> picture at all.

That for sure will not help because nvme_tcp_complete_timed_out stops
the queue. What you can do is maybe just remove it from the pending list
and complete it.