[RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error

Wed Dec 31 16:04:09 PST 2025

On Thu 2025-12-18 18:06:02 -0800, Randy Jennings wrote:
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella at purestorage.com> wrote:
> >
> > An alive nvme controller that hits an error now will move to RECOVERING
> > state instead of RESETTING state. In RECOVERING state ctrl->err_work
> > will attempt to use cross-controller recovery to terminate inflight IOs
> > on the controller. If CCR succeeds, then switch to RESETTING state and
> > continue error recovery as usuall by tearing down controller and attempt
> > reconnecting to target. If CCR fails, then the behavior of recovery
> "usuall" -> "usual"
> "attempt reconnecting" -> "attempting to reconnect"
> 
> it would read better with "the" added:
> "tearing down the controller"
> "reconnect to the target"

Updated as suggested.

> 
> > depends on whether CQT is supported or not. If CQT is supported, switch
> > to time-based recovery by holding inflight IOs until it is safe for them
> > to be retried. If CQT is not supported proceed to retry requests
> > immediately, as the code currently does.
> 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> 
> > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> 
> > +       dev_info(ctrl->device,
> > +                "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > +                jiffies_to_msecs(rem));
> > +       set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > +       queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > +       return -EAGAIN;
> I see how setting this bit before the delayed work executes works
> to complete recovery, but it is kindof weird that the bit is called
> RECOVERED.  I do not have a better name.  TIME_BASED_RECOVERY?
> RECOVERY_WAIT?

Agree. It does look weird. If we agree to add two states FENCING and
FENCED then the flag might not be needed.

> 
> >  static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >  {
> > -       struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > +       struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> >                                 struct nvme_tcp_ctrl, err_work);
> >         struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >
> > +       if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> > +               if (nvme_tcp_recover_ctrl(ctrl))
> > +                       return;
> > +       }
> > +
> >         if (nvme_tcp_key_revoke_needed(ctrl))
> >                 nvme_auth_revoke_tls_key(ctrl);
> >         nvme_stop_keep_alive(ctrl);
> The state of the controller should not be LIVE while waiting for
> recovery, so I do not think we will succeed in sending keep alives,
> but I think this should move to before (or inside of)
> nvme_tcp_recover_ctrl().

This is correct, no keepalive traffic will be sent in RECOVERING state.
If we split fencing work from existing error recovery work then this
should removed. I think we are going in that direction.

> 
> Sincerely,
> Randy Jennings