[PATCH v4 10/15] nvme-tcp: Use CCR to recover controller that hits an error
Randy Jennings
randyj at purestorage.com
Fri May 15 15:51:18 PDT 2026
On Fri, Mar 27, 2026 at 5:46 PM Mohamed Khalfella
<mkhalfella at purestorage.com> wrote:
>
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella at purestorage.com>
> @@ -612,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> + dev_warn(ctrl->device, "starting controller fencing\n");
> + queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> + return;
> + }
> +
> if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> return;
dev_warn(ctrl->device, "starting error recovery\n");
queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
}
> @@ -2471,12 +2478,29 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> +static void nvme_tcp_fencing_work(struct work_struct *work)
> +{
> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}
There is an inconsistency in logging there for starting error recovery.
I'd rather be able to grep logs and find "starting error recovery" even
if we entered the FENCING state (and there is the race condition related
to Hannes's question were the log could get printed if we had a timeout
at just the right time.
As such, I think we should
dev_warn(ctrl->device, "starting error recovery\n");
before queuing the work.
Sincerely,
Randy Jennings
More information about the Linux-nvme
mailing list