[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Wed May 18 23:26:15 PDT 2022

When running on slow links requests might take some time
for be processed, and as we always allow to queue requests
timeout may trigger when the requests are still queued.
Eg sending 128M requests over 30 queues over a 1GigE link
will inevitably timeout before the last request could be sent.
So reset the timeout if the request is still being queued
or if it's in the process of being sent.

Signed-off-by: Hannes Reinecke <hare at suse.de>
---
 drivers/nvme/host/tcp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bb67538d241b..ede76a0719a0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
 		"queue %d: timeout request %#x type %d\n",
 		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
 
+	if (!list_empty(&req->entry) || req->queue->request == req) {
+		dev_warn(ctrl->device,
+			 "queue %d: queue stall, resetting timeout\n",
+			 nvme_tcp_queue_id(req->queue));
+		return BLK_EH_RESET_TIMER;
+	}
+
 	if (ctrl->state != NVME_CTRL_LIVE) {
 		/*
 		 * If we are resetting, connecting or deleting we should
-- 
2.29.2