[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load
Hannes Reinecke
hare at suse.de
Mon May 23 07:01:38 PDT 2022
On 5/23/22 15:36, Sagi Grimberg wrote:
>
>>> The patch title does not explain what the patch does, or what it
>>> fixes.
>>>
>>>> When running on slow links requests might take some time
>>>> for be processed, and as we always allow to queue requests
>>>> timeout may trigger when the requests are still queued.
>>>> Eg sending 128M requests over 30 queues over a 1GigE link
>>>> will inevitably timeout before the last request could be sent.
>>>> So reset the timeout if the request is still being queued
>>>> or if it's in the process of being sent.
>>>
>>> Maybe I'm missing something... But you are overloading so much that you
>>> timeout even before a command is sent out. That still does not change
>>> the fact that the timeout expired. Why is resetting the timer without
>>> taking any action the acceptable action in this case?
>>>
>>> Is this solving a bug? The fact that you get timeouts in your test
>>> is somewhat expected isn't it?
>>>
>>
>> Yes, and no.
>> We happily let requests sit in the (blk-layer) queue for basically any
>> amount of time.
>> And it's a design decision within the driver _when_ to start the timer.
>
> Is it? isn't it supposed to start when the request is queued?
>
Queued where?
>> My point is that starting the timer and _then_ do internal queuing is
>> questionable; we might have returned BLK_STS_AGAIN (or something) when
>> we found that we cannot send requests right now.
>> Or we might have started the timer only when the request is being sent
>> to the HW.
>
> It is not sent to the HW, it is sent down the TCP stack. But it is not
> any different than posting the request to a hw queue on a pci/rdma/fc
> device. The device has some context that process the queue and sends
> it to the wire, in nvme-tcp that context is io_work.
>
>> So returning a timeout in one case but not the other is somewhat erratic.
>
> What is the difference than posting a work request to an rdma nic on
> a congested network? an imaginary 1Gb rdma nic :)
>
> Or maybe lets ask it differently, what happens if you run this test
> on the same nic, but with soft-iwarp/soft-roce interface on top of it?
>
I can't really tell, as I haven't tried.
Can give it a go, though.
>> I would argue that we should only start the timer when requests have
>> had a chance to be sent to the HW; when it's still within the driver
>> one has a hard time arguing why timeouts do apply on one level but not
>> on the other, especially as both levels to exactly the same (to wit:
>> queue commands until they can be sent).
>
> I look at this differently, the way I see it, is that nvme-tcp is
> exactly like nvme-rdma/nvme-fc but also implements context executing the
> command, in software. So in my mind, it is mixing different layers.
>
Hmm. Yes, of course one could take this stance.
Especially given the NVMe-oF notion of 'transport'.
Sadly it's hard to reproduce this with other transports, as they
inevitably only run on HW fast enough to not directly exhibit this
problem (FC is now on 8G min, and IB probably on at least 10G).
The issue arises when running a fio test with a variable size
(4M - 128M), which works on other transports like FC.
For TCP we're running into the said timeouts, but adding things like
blk-cgroup or rq-qos make the issue go away.
So question naturally would be why we need a traffic shaper on TCP, but
not on FC.
>> I'm open to discussion what we should be doing when the request is in
>> the process of being sent. But when it didn't have a chance to be sent
>> and we just overloaded our internal queuing we shouldn't be sending
>> timeouts.
>
> As mentioned above, what happens if that same reporter opens another bug
> that the same phenomenon happens with soft-iwarp? What would you tell
> him/her?
Nope. It's a HW appliance. Not a chance to change that.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
More information about the Linux-nvme
mailing list