[PATCH] nvme-rdma: Remove timeout for getting RDMA-CM established event

Thu Jul 7 00:14:36 PDT 2022

> Hi Sagi
> 
> On 5/17/2022 1:11 PM, Sagi Grimberg wrote:
>>
>>> In case many controllers start error recovery at the same time (i.e.,
>>> when port is down and up), they may never succeed to reconnect again.
>>> This is because the target can't handle all the connect requests at
>>> three seconds (the arbitrary value set today). Even if some of the
>>> connections are established, when a single queue fails to connect,
>>> all the controller's queues are destroyed as well. So, on the
>>> following reconnection attempts the number of connect requests may
>>> remain the same. To fix this, remove the timeout and wait for RDMA-CM
>>> event to abort/complete the connect request. RDMA-CM sends unreachable
>>> event when a timeout of ~90 seconds is expired. This approach is used
>>> at other RDMA-CM users like SRP and iSER at blocking mode.

If we are aligning to srp/iser, we can also align to their cm timeout
as well (1s). Given that it is not the full connection establishment.

>>
>> So with this connecting to an unreachable controller will take 90
>> seconds?
> The answer is yes.
> An unreachable controller is only when adders/route resolve passed 
> successfully, so bad IP will fail immediately.
> When running nvme connect, the user can press (Ctrl + C) to fail the 
> connection immediately. On error recovery
> it is not possible to fail it by pressing (Ctrl + C), but it doesn't 
> block others.
> I ran "nvme connect" with and without this patch and you can see the 
> results at the table below:
> Test
>      With Patch
>      Without patch
> Target is busy with other connections/disconnections     succeed     
> Fail after 3 seconds
> Kill target after successful resolve address and boot after 30 seconds 
> Get reject when target machine is up (30 seconds)     Fail after 3 seconds
> Kill target after successful resolve address and boot after 120 seconds 
> Get unreachable event after ~90 seconds     Fail after 3 seconds
> Port down after successful resolve address and up after 30 seconds 
> succeed     Fail after 3 seconds
> Port down after successful resolve address and up after 120 seconds     
> Get unreachable event after ~90 seconds     Fail after 3 seconds
> At error recovery, one reconnect attempt time is     Up to 90 seconds 
> per controller.     Up to 3 seconds per controller
> Error recovery with many connections     succeed     Never
> 
> Israel
> 

I think its a bit problematic that a (re)connect may take 90 seconds
just to fail, that will block the reconnect_work thread for 90+ seconds...

But I guess its fine for now, and it happens with other transports for
years and was never a concern, so I guess I'm fine with it:

Acked-by: Sagi Grimberg <sagi at grimberg.me>