[PATCH v2 08/14] nvme: Implement cross-controller reset recovery

Tue Feb 10 16:12:43 PST 2026

On Tue 2026-02-10 15:25:55 -0800, Mohamed Khalfella wrote:
> On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> > On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> > >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > >> ...
> > >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > >>> +{
> > >>> +	unsigned long deadline, now, timeout;
> > >>> +	struct nvme_ctrl *sctrl;
> > >>> +	u32 min_cntlid = 0;
> > >>> +	int ret;
> > >>> +
> > >>> +	timeout = nvme_fence_timeout_ms(ictrl);
> > >>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > >>> +
> > >>> +	now = jiffies;
> > >>> +	deadline = now + msecs_to_jiffies(timeout);
> > >>> +	while (time_before(now, deadline)) {
> > >>
> > >> Q: don't we have something to identify the controller's subsystem
> > >> supports CCR before we starting selecting controllers and sending CCR ?
> > >>
> > >> I would think on older devices that don't support it we should be
> > >> skipping this loop.   The loop could delay the Time-Based delay without
> > >> any CCR.
> > > 
> > > I do not think we have something that identifies CCR support at
> > > subsystem level. The spec defines CCRL at the controller level. The loop
> > > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > > not supported and nvme_fence_ctrl() will return immediately.
> > > 
> > >>
> > >> -- james
> > >>
> > 
> > I would think CCRL on the failed controller would be enough to assume 
> > the subsystem supports it.
> 
> ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
> not think it is enough though. I say that for two reasons:
> 
> - May be this controller does not support CCR but others do on the same
>   subsystem. There is nothing prevents subsystem from putting a cap of
>   CCR at subsytem level.
> - May be this controller supports CCR command but not now because all
>   CCR slots are used now. This can happen in the case of cascading
>   failure.
> 
> > 
> > I'm not worried about the coding on the host is so bad. It's more the 
> > multiple paths that must have cmds sent to them and getting error 
> > responses for unknown cmds (should be responded to ok, but you never 
> > know) as well as creating conditions for other errors where there will 
> > be no return for it - e.g. other paths losing connectivity while the ccr 
> > outstanding, etc. yes, they all have to work, but why bother adding 
> > these flows to an old controller that would never do CCR ?
> 
> If nvme_find_ctrl_ccr() returns a source controller to use then we know
> the controller supports CCR and does have an available slot to process
> this CCR request. I do not see how this code will send CCR request to an
> old controller that does not know about CCR command.
> 
> I am not fully opposed against using ictrl->ccr_limit to return early. I
> do not see the need for it. If you feel strongly about it I can update
> nvme_fence_ctrl() to do so.
> 

I forgot to mention that ctrl->ccr_limit is initialized from id->ccrl in 
nvme_init_identify(). If this value is greater than zero then we know
the controller does support CCR. nvme_find_ctrl_ccr() checks for that
and the returned source controller must support CCR and has a slot
available for it.