nvmf/rdma host crash during heavy load and keep alive recovery
Sagi Grimberg
sagi at grimberg.me
Sun Sep 4 02:17:34 PDT 2016
Hey Steve,
> Ok, back to this issue. :)
>
> The same crash happens with mlx4_ib, so this isn't related to cxgb4. To sum up:
>
> With pending NVME IO on the nvme-rdma host, and in the presence of kato
> recovery/reconnect due to the target going away, some NVME requests get
> restarted that are referencing nvmf controllers that have freed queues. I see
> this also with my recent v4 series that corrects the recovery problems with
> nvme-rdma when the target is down, but without pending IO.
>
> So the crash in this email is yet another issue that we see when the nvme host
> has lots of pending IO requests during kato recovery/reconnect...
>
> My findings to date: the IO is not an admin queue IO. It is not the kato
> messages. The io queue has been stopped, yet the request is attempted and
> causes the crash.
>
> Any help is appreciated...
So in the current state, my impression is that we are seeing a request
queued when we shouldn't (or at least assume we won't).
Given that you run heavy load to reproduce this, I can only suspect that
this is a race condition.
Does this happen if you change the reconnect delay to be something
different than 10 seconds? (say 30?)
Can you also give patch [1] a try? It's not a solution, but I want
to see if it hides the problem...
Now, given that you already verified that the queues are stopped with
BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
BLK_MQ_S_STOPPED into account. Theoretically if we free the queue
pairs after we passed these checks while the rq_list is being processed
then we can end-up with this condition, but given that it takes
essentially forever (10 seconds) I tend to doubt this is the case.
HCH, Jens, Keith, any useful pointers for us?
To summarize we see a stray request being queued long after we set
BLK_MQ_S_STOPPED (and by long I mean 10 seconds).
[1]:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d2f891efb27b..38ea5dab4524 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -701,20 +701,13 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
bool changed;
int ret;
- if (ctrl->queue_count > 1) {
- nvme_rdma_free_io_queues(ctrl);
-
- ret = blk_mq_reinit_tagset(&ctrl->tag_set);
- if (ret)
- goto requeue;
- }
-
- nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
ret = blk_mq_reinit_tagset(&ctrl->admin_tag_set);
if (ret)
goto requeue;
+ nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
+
ret = nvme_rdma_init_queue(ctrl, 0, NVMF_AQ_DEPTH);
if (ret)
goto requeue;
@@ -732,6 +725,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
nvme_start_keep_alive(&ctrl->ctrl);
if (ctrl->queue_count > 1) {
+ ret = blk_mq_reinit_tagset(&ctrl->tag_set);
+ if (ret)
+ goto stop_admin_q;
+
+ nvme_rdma_free_io_queues(ctrl);
+
ret = nvme_rdma_init_io_queues(ctrl);
if (ret)
goto stop_admin_q;
--
More information about the Linux-nvme
mailing list