nvmf/rdma host crash during heavy load and keep alive recovery

Sagi Grimberg sagi at grimberg.me
Sun Sep 4 02:17:34 PDT 2016


Hey Steve,

> Ok, back to this issue. :)
>
> The same crash happens with mlx4_ib, so this isn't related to cxgb4.  To sum up:
>
> With pending NVME IO on the nvme-rdma host, and in the presence of kato
> recovery/reconnect due to the target going away, some NVME requests get
> restarted that are referencing nvmf controllers that have freed queues.  I see
> this also with my recent v4 series that corrects the recovery problems with
> nvme-rdma when the target is down, but without pending IO.
>
> So the crash in this email is yet another issue that we see when the nvme host
> has lots of pending IO requests during kato recovery/reconnect...
>
> My findings to date:  the IO is not an admin queue IO.  It is not the kato
> messages.  The io queue has been stopped, yet the request is attempted and
> causes the crash.
>
> Any help is appreciated...

So in the current state, my impression is that we are seeing a request
queued when we shouldn't (or at least assume we won't).

Given that you run heavy load to reproduce this, I can only suspect that
this is a race condition.

Does this happen if you change the reconnect delay to be something
different than 10 seconds? (say 30?)

Can you also give patch [1] a try? It's not a solution, but I want
to see if it hides the problem...

Now, given that you already verified that the queues are stopped with
BLK_MQ_S_STOPPED, I'm looking at blk-mq now.

I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
pairs after we passed these checks while the rq_list is being processed
then we can end-up with this condition, but given that it takes
essentially forever (10 seconds) I tend to doubt this is the case.

HCH, Jens, Keith, any useful pointers for us?

To summarize we see a stray request being queued long after we set
BLK_MQ_S_STOPPED (and by long I mean 10 seconds).



[1]:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d2f891efb27b..38ea5dab4524 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -701,20 +701,13 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         bool changed;
         int ret;

-       if (ctrl->queue_count > 1) {
-               nvme_rdma_free_io_queues(ctrl);
-
-               ret = blk_mq_reinit_tagset(&ctrl->tag_set);
-               if (ret)
-                       goto requeue;
-       }
-
-       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);

         ret = blk_mq_reinit_tagset(&ctrl->admin_tag_set);
         if (ret)
                 goto requeue;

+       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
+
         ret = nvme_rdma_init_queue(ctrl, 0, NVMF_AQ_DEPTH);
         if (ret)
                 goto requeue;
@@ -732,6 +725,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         nvme_start_keep_alive(&ctrl->ctrl);

         if (ctrl->queue_count > 1) {
+               ret = blk_mq_reinit_tagset(&ctrl->tag_set);
+               if (ret)
+                       goto stop_admin_q;
+
+               nvme_rdma_free_io_queues(ctrl);
+
                 ret = nvme_rdma_init_io_queues(ctrl);
                 if (ret)
                         goto stop_admin_q;
--



More information about the Linux-nvme mailing list