[PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT

Mon Feb 16 23:09:33 PST 2026

On 2/16/26 19:45, Mohamed Khalfella wrote:
> On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
>> On 2/14/26 05:25, Mohamed Khalfella wrote:
>>> TP8028 Rapid Path Failure Recovery does not define how much time the
>>> host should wait for CCR operation to complete. It is reasonable to
>>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
>>> CCR operation to be max(ctrl->cqt, ctrl->kato).
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella at purestorage.com>
>>> ---
>>>    drivers/nvme/host/core.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 0680d05900c1..ff479c0263ab 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
>>>    	if (result & 0x01) /* Immediate Reset Successful */
>>>    		goto out;
>>>    
>>> -	tmo = secs_to_jiffies(ictrl->kato);
>>> +	tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
>>>    	if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
>>>    		ret = -ETIMEDOUT;
>>>    		goto out;
>>
>> That is not my understanding. I was under the impression that CQT is the
>> _additional_ time a controller requires to clear out outstanding
>> commands once it detected a loss of communication (ie _after_ KATO).
>> Which would mean we have to wait for up to
>> (ctrl->kato * 1000) + ctrl->cqt.
> 
> At this point the source controller knows about communication loss. We
> do not need kato wait. In theory we should just wait for CQT.
> max(cqt, kato) is a conservative guess I made.
> 
Not quite. The source controller (on the host!) knows about the
communication loss. But the target might not, as the keep-alive
command might have arrived at the target _just_ before KATO
triggered on the host. So the target is still good, and will
be waiting for _another_ KATO interval before declaring
a loss of communication.
And only then will the CQT period start at the target.

Randy, please correct me if I'm wrong ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich