nvme-tcp request timeouts

Wed Oct 12 21:57:24 PDT 2022

On Wed, Oct 12, 2022 at 08:30:18PM +0300, Sagi Grimberg wrote:
> > o- / ......................................................................................................................... [...]
> >    o- hosts ................................................................................................................... [...]
> >    | o- hostnqn ............................................................................................................... [...]
> >    o- ports ................................................................................................................... [...]
> >    | o- 2 ................................................... [trtype=tcp, traddr=..., trsvcid=4420, inline_data_size=16384]
> >    |   o- ana_groups .......................................................................................................... [...]
> >    |   | o- 1 ..................................................................................................... [state=optimized]
> >    |   o- referrals ........................................................................................................... [...]
> >    |   o- subsystems .......................................................................................................... [...]
> >    |     o- testnqn ........................................................................................................... [...]
> >    o- subsystems .............................................................................................................. [...]
> >      o- testnqn ............................................................. [version=1.3, allow_any=1, serial=2c2e39e2a551f7febf33]
> >        o- allowed_hosts ....................................................................................................... [...]
> >        o- namespaces .......................................................................................................... [...]
> >          o- 1  [path=/dev/loop0, uuid=8a1561fb-82c3-4e9d-96b9-11c7b590d047, nguid=ef90689c-6c46-d44c-89c1-4067801309a8, grpid=1, enabled]
> 
> Ohh, I'd say that would be the culprit...
> the loop driver uses only a single queue to access the disk. This means that
> all your 100+ nvme-tcp queues are all serializing access on the single loop
> disk queue. This heavy back-pressure bubbles all the way
> back to the host and manifests in IO timeouts when large bursts hit...
> 
> I can say that loop is not the best way to benchmark performance, and
> I'd expect to see such phenomenons when attempting to drive high loads
> to a loop device...

The goal wasn't to benchmark performance with this setup, just to start
getting familiar.

> Maybe you can possibly use a tmpfs file directly instead (nvmet supports
> file backends as well).
> 
> Or maybe you can try to use null_blk with memory_backed=Y modparam (may need
> to define cache_size modparam as well, never tried it with memory
> backing...)? That would be more efficient.

I've got this set up now with an nvme drive as the backend for the
target, and as you predicted the timeouts went away. So it does seem the
problem was with using a loop device. Thanks for the help!

Seth