[PATCH 1/3] nvme-tcp: spurious I/O timeout under high load

Fri May 20 02:05:14 PDT 2022

The patch title does not explain what the patch does, or what it
fixes.

> When running on slow links requests might take some time
> for be processed, and as we always allow to queue requests
> timeout may trigger when the requests are still queued.
> Eg sending 128M requests over 30 queues over a 1GigE link
> will inevitably timeout before the last request could be sent.
> So reset the timeout if the request is still being queued
> or if it's in the process of being sent.

Maybe I'm missing something... But you are overloading so much that you
timeout even before a command is sent out. That still does not change
the fact that the timeout expired. Why is resetting the timer without
taking any action the acceptable action in this case?

Is this solving a bug? The fact that you get timeouts in your test
is somewhat expected isn't it?

> 
> Signed-off-by: Hannes Reinecke <hare at suse.de>
> ---
>   drivers/nvme/host/tcp.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index bb67538d241b..ede76a0719a0 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2332,6 +2332,13 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
>   		"queue %d: timeout request %#x type %d\n",
>   		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
>   
> +	if (!list_empty(&req->entry) || req->queue->request == req) {
> +		dev_warn(ctrl->device,
> +			 "queue %d: queue stall, resetting timeout\n",
> +			 nvme_tcp_queue_id(req->queue));
> +		return BLK_EH_RESET_TIMER;
> +	}
> +
>   	if (ctrl->state != NVME_CTRL_LIVE) {
>   		/*
>   		 * If we are resetting, connecting or deleting we should