nvmf/rdma host crash during heavy load and keep alive recovery

Steve Wise swise at opengridcomputing.com
Wed Aug 10 08:46:09 PDT 2016


Hey guys, I've rebased the nvmf-4.8-rc branch on top of 4.8-rc2 so I have the
latest/gratest, and continued debugging this crash.  I see:

0) 10 ram disks attached via nvmf/iw_cxgb4, and fio started on all 10 disks.
This node has 8 cores, so that is 80 connections.
1) the cxgb4 interface brought down a few seconds later
2) kato fires on all connections
3) the interface is brought back up 8 seconds after #1
4) 10 seconds after #2 all the qps are destroyed
5) reconnects start happening
6) a blk request is executed and the nvme_rdma_request struct still has a
pointer to one of the qps destroyed in 3 and whamo...

I'm digging into the request cancel logic.  Any ideas/help is greatly
appreciated...

Thanks,

Steve.




More information about the Linux-nvme mailing list