[PATCH] nvme-rdma: Remove timeout for getting RDMA-CM established event

Mon May 16 15:28:53 PDT 2022

On 5/15/22 08:04, Israel Rukshin wrote:
> In case many controllers start error recovery at the same time (i.e.,
> when port is down and up), they may never succeed to reconnect again.
> This is because the target can't handle all the connect requests at
> three seconds (the arbitrary value set today). Even if some of the
> connections are established, when a single queue fails to connect,
> all the controller's queues are destroyed as well. So, on the
> following reconnection attempts the number of connect requests may
> remain the same. To fix this, remove the timeout and wait for RDMA-CM
> event to abort/complete the connect request. RDMA-CM sends unreachable
> event when a timeout of ~90 seconds is expired. This approach is used
> at other RDMA-CM users like SRP and iSER at blocking mode. The commit
> also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
> 
> Signed-off-by: Israel Rukshin <israelr at nvidia.com>
> Reviewed-by: Max Gurtovoy <mgurtovoy at nvidia.com>
> ---

Based on the complexity of components are involved in this, please write
a blktests for rdma transport to make sure this gets tested on each
release.

-ck