nvmf/rdma host crash during heavy load and keep alive recovery

Thu Sep 8 14:37:28 PDT 2016

> > > Not sure if it fixes anything, but we probably need it regardless, can
> > > you give another go with this on top:
> >
> > Still hit it with this on top (had to tweak the patch a little).
> >
> > Steve.
> 
> So with this patch, the crash is a little different.  One thread is in the
> usual place crashed in nvme_rdma_post_send() called by
> nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue have
> been freed.   Actually there are a handful of CPUs processing different
> requests in the same type stack trace.  But perhaps that is expected given
> the work load and number of controllers (10) and queues (32 per
> controller)...
> 
> I also see another worker thread here:
> 
> PID: 3769   TASK: ffff880e18972f40  CPU: 3   COMMAND: "kworker/3:3"
>  #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
>  #1 [ffff880e2f793930] schedule at ffffffff816dff00
>  #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
>  #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
>  #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
>  #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
>  #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
>  #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> [nvme_rdma]
>  #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> [nvme_rdma]
>  #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> [nvme_rdma]
> #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
> 
> I'm trying to identify if this reconnect is for the same controller that
> crashed processing a request.  It probably is, but I need to search the
> stack frame to try and find the controller pointer...
> 
> Steve.

Both the thread that crashed and the thread doing reconnect are operating on
ctrl "nvme2"...