[PATCH 0/3] nvme-tcp: queue stalls under high load

Fri May 20 03:01:48 PDT 2022

On 5/20/22 11:20, Sagi Grimberg wrote:
> 
>> Hi all,
>>
>> one of our partners registered queue stalls and I/O timeouts under
>> high load. Analysis revealed that we see an extremely 'choppy' I/O
>> behaviour when running large transfers on systems on low-performance
>> links (eg 1GigE networks).
>> We had a system with 30 queues trying to transfer 128M requests; simple
>> calculation shows that transferring a _single_ request on all queues
>> will take up to 38 seconds, thereby timing out the last request before
>> it got sent.
>> As a solution I first fixed up the timeout handler to reset the timeout
>> if the request is still queued or in the process of being send. The
>> second path modifies the send path to only allow for new requests if we
>> have enough space on the TX queue, and finally break up the send loop to
>> avoid system stalls when sending large request.
> 
> What is the average latency you are seeing with this test?
> I'm guessing more than 30 seconds :)

Yes, of course. Simple maths, in the end.
(Actually it's more as we're always triggering a reconnect cycle...)
And telling the customer to change his testcase only helps _so_ much.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare at suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer