[PATCH v5 1/6] nvme: authentication error are always non-retryable

Thu Apr 11 01:37:59 PDT 2024

On 11/04/2024 10:11, Hannes Reinecke wrote:
> On 4/10/24 15:50, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 15:05, Hannes Reinecke wrote:
>>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>>> From: Hannes Reinecke <hare at suse.de>
>>>>>>>
>>>>>>> Any authentication errors which are generated internally are always
>>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>>> retried.
>>>>>> The patch title says that any authentication error is not 
>>>>>> retryable, and
>>>>>> the patch body says "authentication errors which are generated 
>>>>>> locally
>>>>>> are non-retryable" so which one is it?
>>>>> Forgot to update the commit message. What about:
>>>>>
>>>>>    All authentication errors are non-retryable, so use negative error
>>>>>    codes to ensure they are not retried.
>>>>>
>>>>> ?
>>>>
>>>> I have a question, what happens if nvmet updated its credentials 
>>>> (by the admin) and in the period until the host got his credentials 
>>>> updated, it
>>>> happens to disconnect/reconnect. It will see an authentication
>>>> error, so it will not retry and remove the controller altogether?
>>>>
>>>> Sounds like an issue to me.
>>>
>>> Usual thing: we cannot differentiate (on the host side) whether the
>>> current PSK is _about_ to be replaced; how should the kernel
>>> know that the admin will replace the PSK in the next minutes?
>>>
>>> But that really is an issue with the standard. Currently there is no
>>> way how a target could inform the initiator that the credentials have
>>> been updated.
>>
>> I'd say that the sane thing for the host to do in this case is to 
>> reconnect
>> until giving up with hope that it may work. This seems like a better 
>> approach
>> than to abruptly remove the controller no?
>>
>>>
>>> We would need to define a new status code for this.
>>> In the meantime the safe operations model is to set a lifetime
>>> for each PSK, and ensure that the PSK is updated on both sides
>>> during the lifetime. With that there is a timeframe during which
>>> both PSKs are available (on the target), and the older will expire
>>> automatically once the lifetime limit is reached.
>>
>> That is a good solution, and will also prevent a loss of service until
>> the host credentials are updated as well.
>>
>> But regardless I have a feeling that simply removing the controller upon
>> an authentication error is not the right thing to do here.
>
> Guess what; that's what I tried to do initially. But then Christoph 
> objected that we shouldn't generate NVMe status codes internally.
> But if we can't do that then we'll have to invent yet another way to 
> return a retryable error, leading to even more band-aid.
> So I am not quite sure how we could achieve that, short of making 
> _every_ error retryable...

So this whole thing is that you want to make the host to not reconnect 
if the controller
sent a DNR and reconnect otherwise?

What are you returning today if the authentication failed? Am I reading 
it right that you are
returning -ECONNREFUSED? I think that for the specific case of 
credentials mismatch (that and only
that) you may want to return -EKEYREJECTED. That according to the 
documentation (/* Key was rejected by service */)
is specific enough that perhaps we can treat it specially when asking 
"should I reconnect?"

Thoughts?