[PATCH 0/3] nvme-tcp: queue stalls under high load
Hannes Reinecke
hare at suse.de
Fri May 20 03:01:48 PDT 2022
On 5/20/22 11:20, Sagi Grimberg wrote:
>
>> Hi all,
>>
>> one of our partners registered queue stalls and I/O timeouts under
>> high load. Analysis revealed that we see an extremely 'choppy' I/O
>> behaviour when running large transfers on systems on low-performance
>> links (eg 1GigE networks).
>> We had a system with 30 queues trying to transfer 128M requests; simple
>> calculation shows that transferring a _single_ request on all queues
>> will take up to 38 seconds, thereby timing out the last request before
>> it got sent.
>> As a solution I first fixed up the timeout handler to reset the timeout
>> if the request is still queued or in the process of being send. The
>> second path modifies the send path to only allow for new requests if we
>> have enough space on the TX queue, and finally break up the send loop to
>> avoid system stalls when sending large request.
>
> What is the average latency you are seeing with this test?
> I'm guessing more than 30 seconds :)
Yes, of course. Simple maths, in the end.
(Actually it's more as we're always triggering a reconnect cycle...)
And telling the customer to change his testcase only helps _so_ much.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
More information about the Linux-nvme
mailing list