nvme-tcp request timeouts

Seth Forshee sforshee at kernel.org
Tue Oct 11 13:14:25 PDT 2022


On Tue, Oct 11, 2022 at 07:30:56PM +0000, Chaitanya Kulkarni wrote:
> Hi Seth,
> 
> On 10/11/22 08:31, Seth Forshee wrote:
> > Hi,
> > 
> > I'm seeing timeouts like the following from nvme-tcp:
> > 
> > [ 6369.513269] nvme nvme5: queue 102: timeout request 0x73 type 4
> > [ 6369.513283] nvme nvme5: starting error recovery
> > [ 6369.514379] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514385] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514392] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514393] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514401] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514414] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514420] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514427] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514430] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514432] block nvme5n1: no usable path - requeuing I/O
> > [ 6369.514926] nvme nvme5: Reconnecting in 10 seconds...
> > [ 6379.761015] nvme nvme5: creating 128 I/O queues.
> > [ 6379.944389] nvme nvme5: mapped 128/0/0 default/read/poll queues.
> > [ 6379.947922] nvme nvme5: Successfully reconnected (1 attempt)
> > 
> > This is with 6.0, using nvmet-tcp on a different machine as the target.
> > I've seen this sporadically with several test cases. The fio fio-rand-RW
> > example test is a pretty good reproducer when numjobs in increased (I'm
> > setting it equal to the number of CPUs in the system).
> > 
> > Let me know what I can do to help debug this. I'm currently adding some
> > tracing to the driver to see if I can get an idea of the sequence of
> > events that leads to this problem.
> > 
> > Thanks,
> > Seth
> > 
> 
> Can you bisect it ? that will help to understand the commit causing
> issue.

I don't know of any "good" version right now. I started with a 5.10
kernel and saw this, and tested 6.0 and still see it. I found several
commits since 5.10 which fix some kind of timeouts:

a0fdd1418007 nvme-tcp: rerun io_work if req_list is not empty
70f437fb4395 nvme-tcp: fix io_work priority inversion
3770a42bb8ce nvme-tcp: fix regression that causes sporadic requests to time out

5.10 still has timeouts with these backported, so whatever the problem
is it has existed at least that long. I suppose I could go back to older
kernels with these backported if that's going to be the best path
forward here.

Thanks,
Seth



More information about the Linux-nvme mailing list