nvmf/rdma host crash during heavy load and keep alive recovery

Thu Sep 8 13:44:00 PDT 2016

> > While working this with debug code to verify that we never create a qp,
> > cq, or cm_id where one already exists for an nvme_rdma_queue, I
> discovered
> > a bug where the Q_DELETING flag is never cleared, and thus a reconnect
> can
> > leak qps and cm_ids.  The fix, I think, is this:
> >
> > @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct
> nvme_rdma_ctrl
> > *ctrl,
> >         int ret;
> >
> >         queue = &ctrl->queues[idx];
> > +       queue->flags = 0;
> >         queue->ctrl = ctrl;
> >         init_completion(&queue->cm_done);
> >
> > I think maybe the clearing of the Q_DELETING flag was lost when we moved
> > to using the ib_client for device removal.   I'll polish this up and
> > submit a patch. It should go with the next 4.8-rc push I think.
> 
> Actually, I think the Q_DELETING flag is no longer needed.  Sagi, can you
> have look at NVME_RDMA_Q_DELETING in the latest code?  I think the
> ib_client patch made the original Q_DELETING patch obsolete.  And the
> original Q_DELETING patch probably needed the above chunk for
> correctness...
> 
> Let me know if you want me to submit something for this issue.  We could
> fix the original patches as they are still only in your nvmf-4.8-rc
> repo...

I see your debug patch v2 you sent me in an earlier email has a clear_bit() to
address the DELETING issue.  I haven't tried that patch yet. :)  Its next on my
list...

steve