nvmf/rdma host crash during heavy load and keep alive recovery
Steve Wise
swise at opengridcomputing.com
Thu Sep 15 08:10:57 PDT 2016
> The state of the controller is NVME_CTRL_RECONNECTING. In fact, this BUG_ON()
> happened on the reconnect worker thread. Ah, this is probably the connect
> command on the admin queue maybe?
>
>
>
The queue being used at the crash is nvme_rdma_ctrl->queues[1]. IE not the
admin queue. The reconnect work thread is connecting the io queues here:
crash> gdb list *nvme_rdma_reconnect_ctrl_work+0x14b
0xffffffffa065cafb is in nvme_rdma_reconnect_ctrl_work
(drivers/nvme/host/rdma.c:647).
642 {
643 int i, ret = 0;
644
645 for (i = 1; i < ctrl->queue_count; i++) {
646 ret = nvmf_connect_io_queue(&ctrl->ctrl, i);
647 if (ret)
648 break;
649 }
650
651 return ret;
nvmf_connect_io_queue() is here:
crash> gdb list *nvmf_connect_io_queue+0x114
0xffffffffa064d134 is in nvmf_connect_io_queue
(drivers/nvme/host/fabrics.c:454).
449 strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
450
451 ret = __nvme_submit_sync_cmd(ctrl->connect_q, &cmd, &cqe,
452 data, sizeof(*data), 0, qid, 1,
453 BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT);
454 if (ret) {
455 nvmf_log_connect_error(ctrl, ret,
le32_to_cpu(cqe.result),
456 &cmd, data);
457 }
458 kfree(data);
The hctx passed into nvme_rdma_queue_rq() is in state BLK_MQ_S_TAG_ACTIVE. And
hctx->driver_data is the nvme_rdma_queue to be used. That nvme_rdma_queue has a
different hctx pointer (from my debug code) and that's why we hit the BUG_ON().
Anyway, nvme_rdma_queue->hctx->state is BLK_MQ_S_STOPPED. So this is more
evidence that somehow an hctx is using an nvme_rdma_queue that wasn't originally
assigned to that hctx...
More information about the Linux-nvme
mailing list