[PATCH] nvme-rdma: Remove timeout for getting RDMA-CM established event

Israel Rukshin israelr at nvidia.com
Wed May 18 05:14:38 PDT 2022


Hi Sagi,

On 5/17/2022 1:11 PM, Sagi Grimberg wrote:
>
>> In case many controllers start error recovery at the same time (i.e.,
>> when port is down and up), they may never succeed to reconnect again.
>> This is because the target can't handle all the connect requests at
>> three seconds (the arbitrary value set today). Even if some of the
>> connections are established, when a single queue fails to connect,
>> all the controller's queues are destroyed as well. So, on the
>> following reconnection attempts the number of connect requests may
>> remain the same. To fix this, remove the timeout and wait for RDMA-CM
>> event to abort/complete the connect request. RDMA-CM sends unreachable
>> event when a timeout of ~90 seconds is expired. This approach is used
>> at other RDMA-CM users like SRP and iSER at blocking mode.
>
> So with this connecting to an unreachable controller will take 90
> seconds?

The answer is yes.
An unreachable controller is only when adders/route resolve passed 
successfully, so bad IP will fail immediately.
When running nvme connect, the user can press (Ctrl + C) to fail the 
connection immediately. On error recovery
it is not possible to fail it by pressing (Ctrl + C), but it doesn't 
block others.
I ran "nvme connect" with and without this patch and you can see the 
results at the table below:

Test                                                       |    With 
Patch                                                      | Without patch
Target is busy with other connections/disconnections               |  
succeed                   |Fail after 3 seconds
Kill target after successful resolve address and boot after 30 seconds   
|  Get reject when target machine is up         |Fail after 3 seconds
Kill target after successful resolve address and boot after 120 seconds 
|  Get unreachable event after ~90 seconds  |Fail after 3 seconds
Port down after successful resolve address and up after 30 seconds     
|  succeed       |Fail after 3 seconds
Port down after successful resolve address and up after 120 seconds   |  
Get unreachable event after ~90 seconds  |Fail after 3 seconds
At error recovery, one reconnect attempt time is                      |  
Up to 90 seconds per controller.        |Up to 3 seconds per controller
Error recovery with many connections                         |   succeed 
                              |Never

Israel




More information about the Linux-nvme mailing list