[PATCH 0/2] nvme-fabrics: short-circuit connect retries

Wed Jul 7 23:29:44 PDT 2021


On 2021/6/29 19:00, Hannes Reinecke wrote:
> On 6/28/21 3:10 AM, Chao Leng wrote:
>>
>>
>> On 2021/6/27 21:39, James Smart wrote:
>>> On 6/26/2021 5:09 AM, Hannes Reinecke wrote:
>>>> On 6/26/21 3:03 AM, Chao Leng wrote:
>>>>>
>>>>>
>>>>> On 2021/6/24 16:10, Hannes Reinecke wrote:
>>>>>> On 6/24/21 9:29 AM, Chao Leng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/6/24 13:51, Hannes Reinecke wrote:
>>>>>>>> On 6/23/21 11:38 PM, Sagi Grimberg wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> commit f25f8ef70ce2 ("nvme-fc: short-circuit reconnect retries")
>>>>>>>>>> allowed the fc transport to honour the DNR bit during reconnect
>>>>>>>>>> retries, allowing to speed up error recovery.
>>>>>>>>>
>>>>>>>>> How does this speed up error recovery?
>>>>>>>>
>>>>>>>> Well, not exactly error recovery (as there is nothing to recover).
>>>>>>>> But we won't attempt pointless retries, thereby reducing the noise in
>>>>>>>> the message log.
>>>>>>> This conflict with the tcp and rdma target.
>>>>>>> You may need to delete the improper NVME_SC_DNR at the target.
>>>>>>> However, this will cause compatibility issues between different versions.
>>>>>>
>>>>>> Which ones?
>>>>> In many scenarios, the destination sets DNR for abnormal packets,
>>>>> but each new connection may not have the same error.
>>>>
>>>> This patch series is only for the DNR bit set in response to the 'connect' command.
>>>> If the target is not able to process the 'connect' command, but may be so in the future it really should not set the DNR bit.
>>>>
>>>>>> I checked the DNR usage in the target code, and they seem to set it
>>>>>> correctly (ie the result would not change when the command is retried).
>>>>>> With the possible exception of ENOSPC handling, as this is arguably
>>>>>> dynamic and might change with a retry.
>>>>> The DNR status of the old connection may not be relevant to the re-established connection.
>>>>
>>>> See above.
>>>> We are just checking the DNR settings for the 'connect' command (or any other commands being sent during initial controller configuration).
>>>> If that fails the connect never was properly initialized; if the controller would return a different status after reconnect it simply should not set the DNR bit ...
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>
>>> Agreed. Since 1.3 spec says: "If set to ‘1’, indicates that if the same command is re-submitted to any controller in the NVM subsystem, then that re-submitted command is expected to fail."
>> According to the definition of the protocol, this is not strictly implemented on the target side.
>> In nvme/target, there are many external errors, the DNR is set to 1.
>> For example, abormal fabrics cmd.
> 
> Yes; but an abnormal fabrics cmd will be abnormal even after a reconnect.
Sorry, delayed reply due to leave.
There are many possibilities for abnormal fabrics cmd.
In some scenarios, if the link is re-established, the error may not occur again,
such as HBA inner error, network is abnormal or attacked etc.
> 
>>>
>>> Thus if the initial connect fails in this manner, any new association will be on a different controller, where it is now expected connect on that controller will fail too.  Thus - why continue to connect when it's expected each will fail.
>> Agree.
>> We should not attempt to re-establish the connection if target can not work due to target inner error .
>> However, now  target does not behave exactly like this, so there are conflicts and compatibility issues.
> 
> Note: we never said the target should _not_ send a DNR state. We only said the target should only set the DNR bit if a retry (even after reconnect or on a different controller) will not change the result.
> Which from what I can see _is_ the case, so I do not see any issues here.
> 
> Please give details for the issues you are concerned with.
For example, if host send a fabric cmd, but the fabric cmd is poisoned
due to HBA inner error, abnormal or attacked network, and then the target
set the DNR to reply response. If do not reconnect for DNR response of
old connection, thus the connecting can not aoto recovery. The fabric
cmd poisoning may be transient, may success if try for reconnecting.
so try to reconnecting is a better choice.
If do not reconnect for DNR response, target should set DNR state just
for target inner error.
> 
> Cheers,
> 
> Hannes