[PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING

Mon Jan 15 04:36:33 PST 2024

On 1/12/24 12:09, hare at kernel.org wrote:
> From: Hannes Reinecke <hare at suse.de>
> 
> Terminating commands from the timeout handler might lead
> to a data corruption as the timeout might trigger before
> KATO expired.
> When several commands have been sent in a batch and
> the command timeouts trigger just after the keep-alive
> command has been sent then the first command will trigger
> the error recovery. But all other commands will timeout
> directly afterwards and will hit the timeout handler
> before the err_work workqueue handler has started.
> This results in these commands being aborted and
> immediately retried without waiting for KATO.
> So return BLK_EH_RESET_TIMER for I/O commands when
> the controller is in 'RESETTING' or 'DELETING'
> state to ensure that the commands will
> be retried only after the KATO interval.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.de>
> ---
>   drivers/nvme/host/tcp.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 08805f027810..9dcb2d3b123c 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2431,17 +2431,27 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>   	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
>   	u8 opc = pdu->cmd.common.opcode, fctype = pdu->cmd.fabrics.fctype;
>   	int qid = nvme_tcp_queue_id(req->queue);
> +	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
>   
>   	dev_warn(ctrl->device,
> -		"queue %d: timeout cid %#x type %d opcode %#x (%s)\n",
> +		"queue %d: timeout cid %#x type %d opcode %#x (%s) state %d\n",
>   		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
> -		opc, nvme_opcode_str(qid, opc, fctype));
> +		opc, nvme_opcode_str(qid, opc, fctype), state);
> +
> +	/*
> +	 * If the controller is in state RESETTING or DELETING all
> +	 * inflight commands will be terminated soon which in turn
> +	 * may failover to a different path.
> +	 */
> +	if ((state == NVME_CTRL_RESETTING ||
> +	     state == NVME_CTRL_DELETING) && qid > 0)
> +		return BLK_EH_RESET_TIMER;
>   
> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> +	if (state != NVME_CTRL_LIVE) {

The remaining states are:
NVME_CTRL_NEW
NVME_CTRL_CONNECTING
NVME_CTRL_DELETING_NOIO
NVME_CTRL_DEAD

The comment perhaps is becoming outdated?

I'm wandering if it would make sense to turn this
into a switch-case statement:
--
	switch (state) {
	case NVME_CTRL_LIVE:
		/*
		 * LIVE state should trigger the normal error recovery
		 * which will handle completing this request.
          	 */
		break;
	case NVME_CTRL_RESETTING:
		/* fallthru */
	case NVME_CTRL_DELETING:
		if (!qid)
			break;
		/*
		 * If the controller is in state RESETTING or DELETING
		 * all inflight I/O commands will be terminated soon
		 * which in turn may failover to a different path.
		 */
		return BLK_EH_RESET_TIMER;
	case NVME_CTRL_DELETING_NOIO;
	case NVME_CTRL_DEAD;
		/* fallthru */
	case NVME_CTRL_CONNECTING;
		/*
		 * Fail immediately to avoid a deadlock if the request
		 * is blocking as part of the ctrl setup or teardown
		 * sequence (e.g. connect, property_set).
		 */
		nvme_tcp_complete_timed_out(rq);
		return BLK_EH_DONE;
	case NVME_CTRL_NEW:
		/* fallthru */
	default:
		WARN_ONCE();
		return BLK_EH_RESET_TIMER;
	}

	nvme_tcp_error_recovery(ctrl);
	return BLK_EH_RESET_TIMER;
--

Thoughts?