[PATCH] nvme-rdma: Remove timeout for getting RDMA-CM established event
Israel Rukshin
israelr at nvidia.com
Wed May 18 05:14:38 PDT 2022
Hi Sagi,
On 5/17/2022 1:11 PM, Sagi Grimberg wrote:
>
>> In case many controllers start error recovery at the same time (i.e.,
>> when port is down and up), they may never succeed to reconnect again.
>> This is because the target can't handle all the connect requests at
>> three seconds (the arbitrary value set today). Even if some of the
>> connections are established, when a single queue fails to connect,
>> all the controller's queues are destroyed as well. So, on the
>> following reconnection attempts the number of connect requests may
>> remain the same. To fix this, remove the timeout and wait for RDMA-CM
>> event to abort/complete the connect request. RDMA-CM sends unreachable
>> event when a timeout of ~90 seconds is expired. This approach is used
>> at other RDMA-CM users like SRP and iSER at blocking mode.
>
> So with this connecting to an unreachable controller will take 90
> seconds?
The answer is yes.
An unreachable controller is only when adders/route resolve passed
successfully, so bad IP will fail immediately.
When running nvme connect, the user can press (Ctrl + C) to fail the
connection immediately. On error recovery
it is not possible to fail it by pressing (Ctrl + C), but it doesn't
block others.
I ran "nvme connect" with and without this patch and you can see the
results at the table below:
Test | With
Patch | Without patch
Target is busy with other connections/disconnections |
succeed |Fail after 3 seconds
Kill target after successful resolve address and boot after 30 seconds
| Get reject when target machine is up |Fail after 3 seconds
Kill target after successful resolve address and boot after 120 seconds
| Get unreachable event after ~90 seconds |Fail after 3 seconds
Port down after successful resolve address and up after 30 seconds
| succeed |Fail after 3 seconds
Port down after successful resolve address and up after 120 seconds |
Get unreachable event after ~90 seconds |Fail after 3 seconds
At error recovery, one reconnect attempt time is |
Up to 90 seconds per controller. |Up to 3 seconds per controller
Error recovery with many connections | succeed
|Never
Israel
More information about the Linux-nvme
mailing list