nvmf/rdma host crash during heavy load and keep alive recovery
Steve Wise
swise at opengridcomputing.com
Thu Sep 8 14:37:28 PDT 2016
> > > Not sure if it fixes anything, but we probably need it regardless, can
> > > you give another go with this on top:
> >
> > Still hit it with this on top (had to tweak the patch a little).
> >
> > Steve.
>
> So with this patch, the crash is a little different. One thread is in the
> usual place crashed in nvme_rdma_post_send() called by
> nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue have
> been freed. Actually there are a handful of CPUs processing different
> requests in the same type stack trace. But perhaps that is expected given
> the work load and number of controllers (10) and queues (32 per
> controller)...
>
> I also see another worker thread here:
>
> PID: 3769 TASK: ffff880e18972f40 CPU: 3 COMMAND: "kworker/3:3"
> #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
> #1 [ffff880e2f793930] schedule at ffffffff816dff00
> #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
> #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
> #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
> #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
> #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
> #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> [nvme_rdma]
> #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> [nvme_rdma]
> #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> [nvme_rdma]
> #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
>
> I'm trying to identify if this reconnect is for the same controller that
> crashed processing a request. It probably is, but I need to search the
> stack frame to try and find the controller pointer...
>
> Steve.
Both the thread that crashed and the thread doing reconnect are operating on
ctrl "nvme2"...
More information about the Linux-nvme
mailing list