[PATCH] nvme-rdma: set ack timeout of RoCE to 262ms
Sagi Grimberg
sagi at grimberg.me
Sun Aug 28 07:57:43 PDT 2022
>>> On 2022/8/21 14:20, Christoph Hellwig wrote:
>>>> On Fri, Aug 19, 2022 at 03:58:25PM +0800, Chao Leng wrote:
>>>>> Now the ack timeout of RoCE is 2 second(2^(18+1)*4us=2 second). In the
>>>>> case of low concurrency, if some packets lost due to network abnormal
>>>>> such as network rerouting, Optical fiber signal interference, etc,
>>>>> it will wait 2 second to try retransmitting the lost packets.
>>>>> As a result, the I/O latency is greater than 2 seconds.
>>>>> The I/O latency is so long for real-time transaction service.
>>>>> Indeed we
>>>>> do not have to wait so long time to make sure that packets are lost.
>>>>> Setting the ack timeout to 262ms(2^(15+1)*4us=262ms) is sufficient.
>>>>
>>>> I'll leave people more familar with RoCE to judge the merits of this
>>>> change, but I really want a comment explaining the choice in the
>>>> source code.
>>> Now the TCP retransmission timeout interval is 250ms, and this setting
>>> has been maintained for many years.
>>> The network quality of rdma is better than that of common Ethernet.
>>> That is the reason to set 262ms as the default ack timeout.
>>> Adding a module parameter may be a better option.
>>
>> Are you solving a real issue you encountered ?
> There is a low probability that this occurs in real scenarios.
> The issue occurs in fault simulation test.
> In the core-leaf fabrics,simulate a fiber fault between the core switch
> and the leaf switch.
> In the case of low concurrency, There is a high probability that the
> I/O latency is greater than 2 seconds.
> This patch can reduce the I/O latency to less than 1 second.
>>
>> If so, which devices did you use ?
> The host HBA is Mellanox Technologies MT27800 Family [ConnectX-5];
> The switch and storage are huawei equipments.
> In principle, switches and storage devices from other vendors
> have the same problem.
> If you think it is necessary, we can test the other vendor switchs
> and linux target.
Why is the 2s default chosen, what is the downside for a 250ms seconds
ack timeout? and why is nvme-rdma different than all other kernel rdma
consumers that it needs to set this explicitly?
Adding linux-rdma folks.
More information about the Linux-nvme
mailing list