[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Sagi Grimberg sagi at grimberg.me
Tue May 24 01:53:05 PDT 2022


>>>>>>> I'm open to discussion what we should be doing when the request 
>>>>>>> is in the process of being sent. But when it didn't have a chance 
>>>>>>> to be sent and we just overloaded our internal queuing we 
>>>>>>> shouldn't be sending timeouts.
>>>>>>
>>>>>> As mentioned above, what happens if that same reporter opens 
>>>>>> another bug
>>>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>>>> him/her?
>>>>>
>>>>> Nope. It's a HW appliance. Not a chance to change that.
>>>>
>>>> It was just a theoretical question.
>>>>
>>>> Do note that I'm not against solving a problem for anyone, I'm just
>>>> questioning if increasing the io_timeout to be unbound in case the
>>>> network is congested, is the right solution for everyone instead of
>>>> a particular case that can easily be solved with udev to make the
>>>> io_timeout to be as high as needed.
>>>>
>>>> One can argue that this patchset is making nvme-tcp to basically
>>>> ignore the device io_timeout in certain cases.
>>>
>>> Oh, yes, sure, that will happen.
>>> What I'm actually arguing is the imprecise difference between 
>>> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
>>> and command timeouts in case of resource constraints on the driver 
>>> implementing ->queue_rq().
>>>
>>> If there is a resource constrain driver is free to return 
>>> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
>>> the request (in which case there will be a timeout).
>>
>> There is no resource constraint. The driver sizes up the resources
>> to be able to queue all the requests it is getting.
>>
>>> I could live with a timeout if that would just result in the command 
>>> being retried. But in the case of nvme it results in a connection 
>>> reset to boot, making customers really nervous that their system is 
>>> broken.
>>
>> But how does the driver know that it is running in this environment that
>> is completely congested? What I'm saying is that this is a specific use
>> case that the solution can have negative side-effects for other common
>> use-cases, because it is beyond the scope of the driver to handle.
>>
>> We can also trigger this condition with nvme-rdma.
>>
>> We could stay with this patch, but I'd argue that this might be the
>> wrong thing to do in certain use-cases.
>>
> Right, okay.
> 
> Arguably this is a workload corner case, and we might not want to fix 
> this in the driver.
> 
> _However_: do we need to do a controller reset in this case?
> Shouldn't it be sufficient to just complete the command w/ timeout error 
> and be done with it?

The question is what is special about this timeout vs. any other
timeout?

pci attempts to abort the command before triggering a controller
reset, Maybe we should also? although abort is not really reliable
going on the admin queue...



More information about the Linux-nvme mailing list