[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Sagi Grimberg sagi at grimberg.me
Tue May 24 00:57:53 PDT 2022


>>>>> I'm open to discussion what we should be doing when the request is 
>>>>> in the process of being sent. But when it didn't have a chance to 
>>>>> be sent and we just overloaded our internal queuing we shouldn't be 
>>>>> sending timeouts.
>>>>
>>>> As mentioned above, what happens if that same reporter opens another 
>>>> bug
>>>> that the same phenomenon happens with soft-iwarp? What would you tell
>>>> him/her?
>>>
>>> Nope. It's a HW appliance. Not a chance to change that.
>>
>> It was just a theoretical question.
>>
>> Do note that I'm not against solving a problem for anyone, I'm just
>> questioning if increasing the io_timeout to be unbound in case the
>> network is congested, is the right solution for everyone instead of
>> a particular case that can easily be solved with udev to make the
>> io_timeout to be as high as needed.
>>
>> One can argue that this patchset is making nvme-tcp to basically
>> ignore the device io_timeout in certain cases.
> 
> Oh, yes, sure, that will happen.
> What I'm actually arguing is the imprecise difference between 
> BLK_STS_AGAIN / BLK_STS_RESOURCE as a return value from ->queue_rq()
> and command timeouts in case of resource constraints on the driver 
> implementing ->queue_rq().
> 
> If there is a resource constrain driver is free to return 
> BLK_STS_RESOURCE (in which case you wouldn't see a timeout) or accept 
> the request (in which case there will be a timeout).

There is no resource constraint. The driver sizes up the resources
to be able to queue all the requests it is getting.

> I could live with a timeout if that would just result in the command 
> being retried. But in the case of nvme it results in a connection reset 
> to boot, making customers really nervous that their system is broken.

But how does the driver know that it is running in this environment that
is completely congested? What I'm saying is that this is a specific use
case that the solution can have negative side-effects for other common
use-cases, because it is beyond the scope of the driver to handle.

We can also trigger this condition with nvme-rdma.

We could stay with this patch, but I'd argue that this might be the
wrong thing to do in certain use-cases.

> And having a workload which can generate connection resets feels like a 
> DoS attack to me; applications shouldn't be able to do that.

Having a workload occupying all the tags all the time is also a DDOS
attack.



More information about the Linux-nvme mailing list