nvmf/rdma host crash during heavy load and keep alive recovery

Steve Wise swise at opengridcomputing.com
Thu Sep 8 12:26:00 PDT 2016


> While working this with debug code to verify that we never create a qp,
> cq, or cm_id where one already exists for an nvme_rdma_queue, I discovered
> a bug where the Q_DELETING flag is never cleared, and thus a reconnect can
> leak qps and cm_ids.  The fix, I think, is this:
> 
> @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl
> *ctrl,
>         int ret;
> 
>         queue = &ctrl->queues[idx];
> +       queue->flags = 0;
>         queue->ctrl = ctrl;
>         init_completion(&queue->cm_done);
> 
> I think maybe the clearing of the Q_DELETING flag was lost when we moved
> to using the ib_client for device removal.   I'll polish this up and
> submit a patch. It should go with the next 4.8-rc push I think.

Actually, I think the Q_DELETING flag is no longer needed.  Sagi, can you have
look at NVME_RDMA_Q_DELETING in the latest code?  I think the ib_client patch
made the original Q_DELETING patch obsolete.  And the original Q_DELETING patch
probably needed the above chunk for correctness...

Let me know if you want me to submit something for this issue.  We could fix the
original patches as they are still only in your nvmf-4.8-rc repo...

Steve.




More information about the Linux-nvme mailing list