nvmf/rdma host crash during heavy load and keep alive recovery
Steve Wise
swise at opengridcomputing.com
Fri Sep 9 08:50:45 PDT 2016
> > So with this patch, the crash is a little different. One thread is in the
> > usual place crashed in nvme_rdma_post_send() called by
> > nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue
> have
> > been freed. Actually there are a handful of CPUs processing different
> > requests in the same type stack trace. But perhaps that is expected given
> > the work load and number of controllers (10) and queues (32 per
> > controller)...
> >
> > I also see another worker thread here:
> >
> > PID: 3769 TASK: ffff880e18972f40 CPU: 3 COMMAND: "kworker/3:3"
> > #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
> > #1 [ffff880e2f793930] schedule at ffffffff816dff00
> > #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
> > #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
> > #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
> > #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
> > #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
> > #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> > [nvme_rdma]
> > #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> > [nvme_rdma]
> > #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> > [nvme_rdma]
> > #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> > #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> > #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> > #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
> >
> > I'm trying to identify if this reconnect is for the same controller that
> > crashed processing a request. It probably is, but I need to search the
> > stack frame to try and find the controller pointer...
> >
> > Steve.
>
> Both the thread that crashed and the thread doing reconnect are operating on
> ctrl "nvme2"...
I'm reanalyzing the crash dump for this particular crash, and I've found the
blk_mq_hw_ctx struct that has ->driver_data == to the nvme_rdma_queue that
caused the crash. hctx->state, though, is 2, which is the BLK_MQ_S_TAG_ACTIVE
bit. IE the BLK_MQ_S_STOPPED bit is _not_ set!
Attached are the blk_mq_hw_ctx, nvme_rdma_queue, and nvme_rdma_ctrl structs, as
well as the nvme_rdma_requeust, request and request_queue structs if you want to
have a look...
Steve.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crash_analysis.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160909/6979423a/attachment-0001.txt>
More information about the Linux-nvme
mailing list