[PATCH 1/4] nvme: authentication error are always non-retryable

Hannes Reinecke hare at suse.de
Thu Mar 7 02:32:24 PST 2024


On 3/7/24 09:51, Sagi Grimberg wrote:
> 
> 
> On 01/03/2024 17:26, Hannes Reinecke wrote:
>> On 3/1/24 14:12, Christoph Hellwig wrote:
>>>> -    if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
>>>> -        return AUTHENTICATE;
>>>> -
>>>>       if (blk_noretry_request(req) ||
>>>>           (nvme_req(req)->status & NVME_SC_DNR) ||
>>>>           nvme_req(req)->retries >= nvme_max_retries)
>>>>           return COMPLETE;
>>>>   +    if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
>>>> +        return AUTHENTICATE;
>>>
>>> This part looks fine (although I'd word the commit message
>>> differently for it).
>>>
>>>> @@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl 
>>>> *ctrl)
>>>>           if (result & NVME_CONNECT_AUTHREQ_ASCR) {
>>>>               dev_warn(ctrl->device,
>>>>                    "qid 0: secure concatenation is not supported\n");
>>>> -            ret = NVME_SC_AUTH_REQUIRED;
>>>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>>>               goto out_free_data;
>>>>           }
>>>>           /* Authentication required */
>>>> @@ -475,14 +475,16 @@ int nvmf_connect_admin_queue(struct nvme_ctrl 
>>>> *ctrl)
>>>>           if (ret) {
>>>>               dev_warn(ctrl->device,
>>>>                    "qid 0: authentication setup failed\n");
>>>> -            ret = NVME_SC_AUTH_REQUIRED;
>>>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>>
>>>
>>> .. but the others should never use nvme status codes as they never
>>> go out onto the wire.
>>>
>> Would love to, but to my knowledge we have the convention that NVMe 
>> status codes less than 0 should _not_ be retried.
> 
> What nvme status code is less than zero?
> 
Not the NVMe status, but the 'ret' value containing either a (positive) 
NVMe status or a (negative) errno.

>> NVMe transport errors,
>> OTOH, _should_ be retried.
> Yes.
>> 'connect_admin_queue' and 'connect_io_queue' now straddles the 
>> boundary: technically it's an NVMe command, but in
>> practice it's a transport command as there are quite a lot of steps
>> before we even can send the 'connect' command.
>>
>> So, how do we handle the return codes from 'connect_admin_queue' ?
>> The NVMe core style (with not retrying negative errors) or the NVMe
>> transport style (with always retrying until we run out of retries)?
> 
> Not exactly sure what is the question here. I am still going over emails 
> so there
> may be some discussion around this, but it is unclear from the patch 
> description what is the issue this is solving.

The question is what to do if nvmf_connect_admin_queue() returns with a 
status with the DNR status set.
In TCP the status is carried back to the return value of 
nvme_tcp_setup_ctrl(), which is called during reset, reconnect, and 
initialisation.
For the first two the status is simply ignored, and always retried
(ie every error is treated as retryable). For the latter, we will always 
abort the initialisation (ie every error is non-retryable).
Either way, we're completely ignoring the DNR bit in the NVMe status.

Daniel has sent two patches on my behalf, trying to evaluate the DNR
status during reconnect and reset. With them it suddenly matters how
the return value from nvmf_connect_admin_queue() is to be interpreted.
With my original idea of retrying authentication errors we will keep
on retrying, even though the authentication settings won't change,
and consequently each retry will fail. So in effect we go through
pointless retries, and only abort the test case once the retries
are exhausted. By making authentication errors non-retryable we
can terminate the testcase directly.
The only way I could think of to signal a non-retryable error
is to return an NVMe status with the DNR bit set; every other error
will cause a retry. But this will require us to fabricate an NVMe
status code, which Christoph objects to.

Cheers,

Hannes




More information about the Linux-nvme mailing list