[PATCH 0/2] nvme-fabrics: short-circuit connect retries

Chao Leng lengchao at huawei.com
Sun Jun 27 18:10:35 PDT 2021



On 2021/6/27 21:39, James Smart wrote:
> On 6/26/2021 5:09 AM, Hannes Reinecke wrote:
>> On 6/26/21 3:03 AM, Chao Leng wrote:
>>>
>>>
>>> On 2021/6/24 16:10, Hannes Reinecke wrote:
>>>> On 6/24/21 9:29 AM, Chao Leng wrote:
>>>>>
>>>>>
>>>>> On 2021/6/24 13:51, Hannes Reinecke wrote:
>>>>>> On 6/23/21 11:38 PM, Sagi Grimberg wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> commit f25f8ef70ce2 ("nvme-fc: short-circuit reconnect retries")
>>>>>>>> allowed the fc transport to honour the DNR bit during reconnect
>>>>>>>> retries, allowing to speed up error recovery.
>>>>>>>
>>>>>>> How does this speed up error recovery?
>>>>>>
>>>>>> Well, not exactly error recovery (as there is nothing to recover).
>>>>>> But we won't attempt pointless retries, thereby reducing the noise in
>>>>>> the message log.
>>>>> This conflict with the tcp and rdma target.
>>>>> You may need to delete the improper NVME_SC_DNR at the target.
>>>>> However, this will cause compatibility issues between different versions.
>>>>
>>>> Which ones?
>>> In many scenarios, the destination sets DNR for abnormal packets,
>>> but each new connection may not have the same error.
>>
>> This patch series is only for the DNR bit set in response to the 'connect' command.
>> If the target is not able to process the 'connect' command, but may be so in the future it really should not set the DNR bit.
>>
>>>> I checked the DNR usage in the target code, and they seem to set it
>>>> correctly (ie the result would not change when the command is retried).
>>>> With the possible exception of ENOSPC handling, as this is arguably
>>>> dynamic and might change with a retry.
>>> The DNR status of the old connection may not be relevant to the re-established connection.
>>
>> See above.
>> We are just checking the DNR settings for the 'connect' command (or any other commands being sent during initial controller configuration).
>> If that fails the connect never was properly initialized; if the controller would return a different status after reconnect it simply should not set the DNR bit ...
>>
>> Cheers,
>>
>> Hannes
> 
> Agreed. Since 1.3 spec says: "If set to ‘1’, indicates that if the same command is re-submitted to any controller in the NVM subsystem, then that re-submitted command is expected to fail."
According to the definition of the protocol, this is not strictly implemented on the target side.
In nvme/target, there are many external errors, the DNR is set to 1.
For example, abormal fabrics cmd.
> 
> Thus if the initial connect fails in this manner, any new association will be on a different controller, where it is now expected connect on that controller will fail too.  Thus - why continue to connect when it's expected each will fail.
Agree.
We should not attempt to re-establish the connection if target can not work due to target inner error .
However, now  target does not behave exactly like this, so there are conflicts and compatibility issues.
> 
> -- james
> .



More information about the Linux-nvme mailing list