[PATCH 1/4] nvme: authentication error are always non-retryable

Thu Mar 7 03:37:12 PST 2024

On 07/03/2024 12:32, Hannes Reinecke wrote:
> On 3/7/24 09:51, Sagi Grimberg wrote:
>>
>>
>> On 01/03/2024 17:26, Hannes Reinecke wrote:
>>> On 3/1/24 14:12, Christoph Hellwig wrote:
>>>>> -    if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
>>>>> -        return AUTHENTICATE;
>>>>> -
>>>>>       if (blk_noretry_request(req) ||
>>>>>           (nvme_req(req)->status & NVME_SC_DNR) ||
>>>>>           nvme_req(req)->retries >= nvme_max_retries)
>>>>>           return COMPLETE;
>>>>>   +    if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
>>>>> +        return AUTHENTICATE;
>>>>
>>>> This part looks fine (although I'd word the commit message
>>>> differently for it).
>>>>
>>>>> @@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl 
>>>>> *ctrl)
>>>>>           if (result & NVME_CONNECT_AUTHREQ_ASCR) {
>>>>>               dev_warn(ctrl->device,
>>>>>                    "qid 0: secure concatenation is not supported\n");
>>>>> -            ret = NVME_SC_AUTH_REQUIRED;
>>>>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>>>>               goto out_free_data;
>>>>>           }
>>>>>           /* Authentication required */
>>>>> @@ -475,14 +475,16 @@ int nvmf_connect_admin_queue(struct 
>>>>> nvme_ctrl *ctrl)
>>>>>           if (ret) {
>>>>>               dev_warn(ctrl->device,
>>>>>                    "qid 0: authentication setup failed\n");
>>>>> -            ret = NVME_SC_AUTH_REQUIRED;
>>>>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>>>
>>>>
>>>> .. but the others should never use nvme status codes as they never
>>>> go out onto the wire.
>>>>
>>> Would love to, but to my knowledge we have the convention that NVMe 
>>> status codes less than 0 should _not_ be retried.
>>
>> What nvme status code is less than zero?
>>
> Not the NVMe status, but the 'ret' value containing either a 
> (positive) NVMe status or a (negative) errno.
>
>>> NVMe transport errors,
>>> OTOH, _should_ be retried.
>> Yes.
>>> 'connect_admin_queue' and 'connect_io_queue' now straddles the 
>>> boundary: technically it's an NVMe command, but in
>>> practice it's a transport command as there are quite a lot of steps
>>> before we even can send the 'connect' command.
>>>
>>> So, how do we handle the return codes from 'connect_admin_queue' ?
>>> The NVMe core style (with not retrying negative errors) or the NVMe
>>> transport style (with always retrying until we run out of retries)?
>>
>> Not exactly sure what is the question here. I am still going over 
>> emails so there
>> may be some discussion around this, but it is unclear from the patch 
>> description what is the issue this is solving.
>
> The question is what to do if nvmf_connect_admin_queue() returns with 
> a status with the DNR status set.
> In TCP the status is carried back to the return value of 
> nvme_tcp_setup_ctrl(), which is called during reset, reconnect, and 
> initialisation.
> For the first two the status is simply ignored, and always retried
> (ie every error is treated as retryable). For the latter, we will 
> always abort the initialisation (ie every error is non-retryable).
> Either way, we're completely ignoring the DNR bit in the NVMe status.

IMO We should always fail initial connect, for whatever reason. 
userspace can take care
of it if it wants to reconnect. If we managed to ever connect to a 
controller, we will retry
once we lose the connection.

>
> Daniel has sent two patches on my behalf, trying to evaluate the DNR
> status during reconnect and reset.

Didn't get to those yet.

> With them it suddenly matters how
> the return value from nvmf_connect_admin_queue() is to be interpreted.

OK, The spec should probably be clear about it.

> With my original idea of retrying authentication errors we will keep
> on retrying, even though the authentication settings won't change,
> and consequently each retry will fail. So in effect we go through
> pointless retries, and only abort the test case once the retries
> are exhausted. By making authentication errors non-retryable we
> can terminate the testcase directly.
> The only way I could think of to signal a non-retryable error
> is to return an NVMe status with the DNR bit set; every other error
> will cause a retry. But this will require us to fabricate an NVMe
> status code, which Christoph objects to.

Can we start with defining what is a retry-able authentication error and 
what isn't?