[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Mon May 23 06:36:57 PDT 2022

>> The patch title does not explain what the patch does, or what it
>> fixes.
>>
>>> When running on slow links requests might take some time
>>> for be processed, and as we always allow to queue requests
>>> timeout may trigger when the requests are still queued.
>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>> will inevitably timeout before the last request could be sent.
>>> So reset the timeout if the request is still being queued
>>> or if it's in the process of being sent.
>>
>> Maybe I'm missing something... But you are overloading so much that you
>> timeout even before a command is sent out. That still does not change
>> the fact that the timeout expired. Why is resetting the timer without
>> taking any action the acceptable action in this case?
>>
>> Is this solving a bug? The fact that you get timeouts in your test
>> is somewhat expected isn't it?
>>
> 
> Yes, and no.
> We happily let requests sit in the (blk-layer) queue for basically any 
> amount of time.
> And it's a design decision within the driver _when_ to start the timer.

Is it? isn't it supposed to start when the request is queued?

> My point is that starting the timer and _then_ do internal queuing is 
> questionable; we might have returned BLK_STS_AGAIN (or something) when 
> we found that we cannot send requests right now.
> Or we might have started the timer only when the request is being sent 
> to the HW.

It is not sent to the HW, it is sent down the TCP stack. But it is not
any different than posting the request to a hw queue on a pci/rdma/fc
device. The device has some context that process the queue and sends
it to the wire, in nvme-tcp that context is io_work.

> So returning a timeout in one case but not the other is somewhat erratic.

What is the difference than posting a work request to an rdma nic on
a congested network? an imaginary 1Gb rdma nic :)

Or maybe lets ask it differently, what happens if you run this test
on the same nic, but with soft-iwarp/soft-roce interface on top of it?

> I would argue that we should only start the timer when requests have had 
> a chance to be sent to the HW; when it's still within the driver one has 
> a hard time arguing why timeouts do apply on one level but not on the 
> other, especially as both levels to exactly the same (to wit: queue 
> commands until they can be sent).

I look at this differently, the way I see it, is that nvme-tcp is
exactly like nvme-rdma/nvme-fc but also implements context executing the
command, in software. So in my mind, it is mixing different layers.

> I'm open to discussion what we should be doing when the request is in 
> the process of being sent. But when it didn't have a chance to be sent 
> and we just overloaded our internal queuing we shouldn't be sending 
> timeouts.

As mentioned above, what happens if that same reporter opens another bug
that the same phenomenon happens with soft-iwarp? What would you tell
him/her?