nvmf/rdma host crash during heavy load and keep alive recovery
Steve Wise
swise at opengridcomputing.com
Thu Sep 8 13:44:00 PDT 2016
> > While working this with debug code to verify that we never create a qp,
> > cq, or cm_id where one already exists for an nvme_rdma_queue, I
> discovered
> > a bug where the Q_DELETING flag is never cleared, and thus a reconnect
> can
> > leak qps and cm_ids. The fix, I think, is this:
> >
> > @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct
> nvme_rdma_ctrl
> > *ctrl,
> > int ret;
> >
> > queue = &ctrl->queues[idx];
> > + queue->flags = 0;
> > queue->ctrl = ctrl;
> > init_completion(&queue->cm_done);
> >
> > I think maybe the clearing of the Q_DELETING flag was lost when we moved
> > to using the ib_client for device removal. I'll polish this up and
> > submit a patch. It should go with the next 4.8-rc push I think.
>
> Actually, I think the Q_DELETING flag is no longer needed. Sagi, can you
> have look at NVME_RDMA_Q_DELETING in the latest code? I think the
> ib_client patch made the original Q_DELETING patch obsolete. And the
> original Q_DELETING patch probably needed the above chunk for
> correctness...
>
> Let me know if you want me to submit something for this issue. We could
> fix the original patches as they are still only in your nvmf-4.8-rc
> repo...
I see your debug patch v2 you sent me in an earlier email has a clear_bit() to
address the DELETING issue. I haven't tried that patch yet. :) Its next on my
list...
steve
More information about the Linux-nvme
mailing list