nvme-tcp request timeouts

Wed Oct 12 10:30:18 PDT 2022

On 10/12/22 19:55, Seth Forshee wrote:
> On Wed, Oct 12, 2022 at 09:33:19AM +0300, Sagi Grimberg wrote:
>> Hey Seth, thanks for reporting.
>>
>>>>>> Hi Seth,
>>>>>>
>>>>>> On 10/11/22 08:31, Seth Forshee wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm seeing timeouts like the following from nvme-tcp:
>>>>>>>
>>>>>>> [ 6369.513269] nvme nvme5: queue 102: timeout request 0x73 type 4
>>>>>>> [ 6369.513283] nvme nvme5: starting error recovery
>>>>>>> [ 6369.514379] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514385] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514392] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514393] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514401] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514414] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514420] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514427] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514430] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514432] block nvme5n1: no usable path - requeuing I/O
>>>>>>> [ 6369.514926] nvme nvme5: Reconnecting in 10 seconds...
>>>>>>> [ 6379.761015] nvme nvme5: creating 128 I/O queues.
>>>>>>> [ 6379.944389] nvme nvme5: mapped 128/0/0 default/read/poll queues.
>>>>>>> [ 6379.947922] nvme nvme5: Successfully reconnected (1 attempt)
>>>>>>>
>>>>>>> This is with 6.0, using nvmet-tcp on a different machine as the target.
>>>>>>> I've seen this sporadically with several test cases. The fio fio-rand-RW
>>>>>>> example test is a pretty good reproducer when numjobs in increased (I'm
>>>>>>> setting it equal to the number of CPUs in the system).
>>>>>>>
>>>>>>> Let me know what I can do to help debug this. I'm currently adding some
>>>>>>> tracing to the driver to see if I can get an idea of the sequence of
>>>>>>> events that leads to this problem.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Seth
>>>>>>>
>>>>>>
>>>>>> Can you bisect it ? that will help to understand the commit causing
>>>>>> issue.
>>>>>
>>>>> I don't know of any "good" version right now. I started with a 5.10
>>>>> kernel and saw this, and tested 6.0 and still see it. I found several
>>>>> commits since 5.10 which fix some kind of timeouts:
>>>>>
>>>>> a0fdd1418007 nvme-tcp: rerun io_work if req_list is not empty
>>>>> 70f437fb4395 nvme-tcp: fix io_work priority inversion
>>>>> 3770a42bb8ce nvme-tcp: fix regression that causes sporadic requests to time out
>>>>>
>>>>> 5.10 still has timeouts with these backported, so whatever the problem
>>>>> is it has existed at least that long. I suppose I could go back to older
>>>>> kernels with these backported if that's going to be the best path
>>>>> forward here.
>>>>>
>>>>> Thanks,
>>>>> Seth
>>>>
>>>> Can you please share the fio config you are using ?
>>>
>>> Sure. Note that I can reproduce it with a lower number of numjobs, but
>>> higher numbers make it easier, so I set it to the number of CPUs present
>>> on the system I'm using to test.
>>>
>>>
>>> [global]
>>> name=fio-rand-RW
>>> filename=fio-rand-RW
>>> rw=randrw
>>> rwmixread=60
>>> rwmixwrite=40
>>> bs=4K
>>> direct=0
>>
>> Does this happen with direct=1?
> 
> So far I haven't been able to reproduce it so far with direct=1.

It is probably because the ios are actually going through as 4k and
blk-mq tags depth is throttling you. on direct=0 writes are batched and
go down to the disk in larger blocks, and in bursts, which probably is
creating longer latencies...

> 
>>> numjobs=128
>>> time_based
>>> runtime=900
>>>
>>> [file1]
>>> size=10G
>>> ioengine=libaio
>>> iodepth=16
>>>
>>
>> Is it possible that the backend nvmet device is not fast enough?
>> Are your backend devices nvme? Can you share nvmetcli ls output?
> 
> Currently I have a quick-and-dirty setup. The systems I have to test
> with don't have a spare nvme device right now but do have a lot of RAM,
> so the backend is a loop device backed by a file in tmpfs. But it should
> be plenty fast. I'm working on getting a setup with an nvme backend
> device.

I see, note that ram drives are fast, but stall the cpu for memory copy,
which will prevent nvme io threads from servicing new requests, this can
be magnified with large heavy bursts...

> 
> o- / ......................................................................................................................... [...]
>    o- hosts ................................................................................................................... [...]
>    | o- hostnqn ............................................................................................................... [...]
>    o- ports ................................................................................................................... [...]
>    | o- 2 ................................................... [trtype=tcp, traddr=..., trsvcid=4420, inline_data_size=16384]
>    |   o- ana_groups .......................................................................................................... [...]
>    |   | o- 1 ..................................................................................................... [state=optimized]
>    |   o- referrals ........................................................................................................... [...]
>    |   o- subsystems .......................................................................................................... [...]
>    |     o- testnqn ........................................................................................................... [...]
>    o- subsystems .............................................................................................................. [...]
>      o- testnqn ............................................................. [version=1.3, allow_any=1, serial=2c2e39e2a551f7febf33]
>        o- allowed_hosts ....................................................................................................... [...]
>        o- namespaces .......................................................................................................... [...]
>          o- 1  [path=/dev/loop0, uuid=8a1561fb-82c3-4e9d-96b9-11c7b590d047, nguid=ef90689c-6c46-d44c-89c1-4067801309a8, grpid=1, enabled]

Ohh, I'd say that would be the culprit...
the loop driver uses only a single queue to access the disk. This means 
that all your 100+ nvme-tcp queues are all serializing access on the 
single loop disk queue. This heavy back-pressure bubbles all the way
back to the host and manifests in IO timeouts when large bursts hit...

I can say that loop is not the best way to benchmark performance, and
I'd expect to see such phenomenons when attempting to drive high loads
to a loop device...

Maybe you can possibly use a tmpfs file directly instead (nvmet supports
file backends as well).

Or maybe you can try to use null_blk with memory_backed=Y modparam (may 
need to define cache_size modparam as well, never tried it with memory
backing...)? That would be more efficient.