[PATCH 0/2] nvme-fabrics: short-circuit connect retries

Sun Jun 27 06:39:14 PDT 2021

On 6/26/2021 5:09 AM, Hannes Reinecke wrote:
> On 6/26/21 3:03 AM, Chao Leng wrote:
>>
>>
>> On 2021/6/24 16:10, Hannes Reinecke wrote:
>>> On 6/24/21 9:29 AM, Chao Leng wrote:
>>>>
>>>>
>>>> On 2021/6/24 13:51, Hannes Reinecke wrote:
>>>>> On 6/23/21 11:38 PM, Sagi Grimberg wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> commit f25f8ef70ce2 ("nvme-fc: short-circuit reconnect retries")
>>>>>>> allowed the fc transport to honour the DNR bit during reconnect
>>>>>>> retries, allowing to speed up error recovery.
>>>>>>
>>>>>> How does this speed up error recovery?
>>>>>
>>>>> Well, not exactly error recovery (as there is nothing to recover).
>>>>> But we won't attempt pointless retries, thereby reducing the noise in
>>>>> the message log.
>>>> This conflict with the tcp and rdma target.
>>>> You may need to delete the improper NVME_SC_DNR at the target.
>>>> However, this will cause compatibility issues between different 
>>>> versions.
>>>
>>> Which ones?
>> In many scenarios, the destination sets DNR for abnormal packets,
>> but each new connection may not have the same error.
> 
> This patch series is only for the DNR bit set in response to the 
> 'connect' command.
> If the target is not able to process the 'connect' command, but may be 
> so in the future it really should not set the DNR bit.
> 
>>> I checked the DNR usage in the target code, and they seem to set it
>>> correctly (ie the result would not change when the command is retried).
>>> With the possible exception of ENOSPC handling, as this is arguably
>>> dynamic and might change with a retry.
>> The DNR status of the old connection may not be relevant to the 
>> re-established connection.
> 
> See above.
> We are just checking the DNR settings for the 'connect' command (or any 
> other commands being sent during initial controller configuration).
> If that fails the connect never was properly initialized; if the 
> controller would return a different status after reconnect it simply 
> should not set the DNR bit ...
> 
> Cheers,
> 
> Hannes

Agreed. Since 1.3 spec says: "If set to ‘1’, indicates that if the same 
command is re-submitted to any controller in the NVM subsystem, then 
that re-submitted command is expected to fail."

Thus if the initial connect fails in this manner, any new association 
will be on a different controller, where it is now expected connect on 
that controller will fail too.  Thus - why continue to connect when it's 
expected each will fail.

-- james