I/O Errors due to keepalive timeouts with NVMf RDMA

Sagi Grimberg sagi at grimberg.me
Mon Jul 10 05:04:52 PDT 2017


>>> [353698.784927] nvme nvme0: creating 44 I/O queues.
>>> [353699.572467] nvme nvme0: new ctrl: NQN
>>> "nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
>>> addr 1.1.1.2:4420
>>> [353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
>>> transport retry counter exceeded (12)
>>
>> Exhausted retries, wow... That is really strange...
>>
>> Host sent the keep-alive and it never made it to the host, the HCA
>> retried for 7+ times and gave up.
>>
>> Are you running with a switch? which one? is the switch experience
>> higher ingress?
> 
> This (unfortunately) was the OmniPath setup as I only was a guest on the IB
> installation and the other team needed it back. Anyways I did see this on IB
> as well (regardless of SLE12-SP3 and v4.12 final). The switch is an Intel Edge
> 100 OmniPath switch.

Note that your keep-alive does not fail after 120 seconds, it is failed
by the HCA after 7 HCA retries (which is roughly around 35 seconds).

And if your keep-alive did not make it in 35 seconds, then its an
indication that something is wrong... which is exactly what keep-alives
are designed to do... So I'm not at all sure that we need to compensate
for this in the driver at all, something is clearly wrong in your
fabric.



More information about the Linux-nvme mailing list