I/O Errors due to keepalive timeouts with NVMf RDMA

Mon Jul 10 04:50:03 PDT 2017

On Mon, Jul 10, 2017 at 02:41:28PM +0300, Sagi Grimberg wrote:
> >Host:
> >[353698.784927] nvme nvme0: creating 44 I/O queues.
> >[353699.572467] nvme nvme0: new ctrl: NQN
> >"nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
> >addr 1.1.1.2:4420
> >[353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
> >transport retry counter exceeded (12)
> 
> Exhausted retries, wow... That is really strange...
> 
> Host sent the keep-alive and it never made it to the host, the HCA
> retried for 7+ times and gave up.
> 
> Are you running with a switch? which one? is the switch experience
> higher ingress?

This (unfortunately) was the OmniPath setup as I only was a guest on the IB
installation and the other team needed it back. Anyways I did see this on IB
as well (regardless of SLE12-SP3 and v4.12 final). The switch is an Intel Edge
100 OmniPath switch.

[...]

> 
> And why aren't you able to reconnect?
> 
> Something smells mis-configured here...

I am it just takes ages:
[354235.064586] nvme nvme0: Failed reconnect attempt 27
[354235.076054] nvme nvme0: Reconnecting in 10 seconds...
[354245.117100] nvme nvme0: rdma_resolve_addr wait failed (-104).
[354245.144574] nvme nvme0: Failed reconnect attempt 28
[354245.156097] nvme nvme0: Reconnecting in 10 seconds...
[354255.244008] nvme nvme0: creating 44 I/O queues.
[354255.877529] nvme nvme0: Successfully reconnected
[354255.900579] nvme0n1: detected capacity change from -67526893324191744 to
68719476736

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850