[PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
Steve Wise
swise at opengridcomputing.com
Mon Jul 18 11:04:55 PDT 2016
> > > > > Hey Sagi, here is some lite reading for you. :)
> > > > >
> > > > > Prelude: As part of disconnecting an iwarp connection, the iwarp
> provider
> > > needs
> > > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto
> the
> > > > > singlethread workq thread for iw_cm.
> > > > >
> > > > > Here is what happens with Sagi's patch:
> > > > >
> > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > > > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the
> > > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler()
> in
> > > the
> > > > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm
event
> > > > handler
> > > > > for this cm_id, function cm_iw_handler(), which blocks until any
> currently
> > > > > running event handler for this cm_id finishes. It does this by
calling
> > > > > cm_disable_callback(). However since this whole unplug process is
> running
> > > in
> > > > > the event handler function for this same cm_id, the iw_cm workq thread
> is
> > > now
> > > > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on
> and
> > > > > schedules the controller delete worker thread and waits for it to
> > complete.
> > > The
> > > > > delete controller worker thread tries to disconnect and destroy all
the
> > > > > remaining IO queues, but gets stuck in the destroy() path on the first
> IO
> > > queue
> > > > > because the iw_cm workq thread is already stuck, and processing the
> CLOSE
> > > > event
> > > > > is required to release a reference the iw_cm has on the iwarp
providers
> > qp.
> > > So
> > > > > everything comes to a grinding halt....
> > > > >
> > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that
> received
> > > the
> > > > > device removal event is disconnected/destroyed _only after_ all the
> > > controller
> > > > > queues are disconnected/destroyed. So nvme_rdma_device_unplug()
> doesn't
> > > get
> > > > > stuck waiting for the controller to delete the io queues, and only
after
> > > that
> > > > > completes, does it delete the cm_id/qp that got the device removal
> event.
> > > It
> > > > > then returns thus causing the rdma_cm to release the cm_id's callback
> > mutex.
> > > > > This causes the iw_cm workq thread to now unblock and we continue on.
> > (can
> > > > you
> > > > > say house of cards?)
> > > > >
> > > > > So the net is: the cm_id that received the device remove event _must_
> be
> > > > > disconnect/destroyed _last_.
> > > >
> > > > Hey Steve, thanks for the detailed description.
> > > >
> > > > IMHO, this really looks like a buggy design in iWARP connection
> > > > management implementation. The fact that a rdma_disconnect forward
> > > > progress is dependent on other cm_id execution looks wrong and
> > > > backwards to me.
> > > >
> > >
> > > Yea. I'm not sure how to fix it at this point...
> > >
> > > > Given that the device is being removed altogether I don't think that
> > > > calling rdma_disconnect is important at all. Given the fact that
> > > > ib_drain_qp makes sure that the queue-pair is in error state does
> > > > this incremental patch resolve the iwarp-deadlock you are seeing?
> > > >
> > >
> > > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.
> > When
> > > I implemented them I assumed the QP would be disconnected. Let me try
this
> > and
> > > report back.
> >
> > Just doing the drain has the same issue because moving the QP to ERROR does
> the
> > same process as a disconnect wrt the provider/IWCM interactions.
> >
> > The reason for this close provider/iwcm interaction is needed is due to the
> > IWARP relationship between QPs and the TCP connection (and thus the cm_id).
> The
> > iwarp driver stack manipulates the QP state based on the connection events
> > directly. For example, the transition from IDLE -> RTS is done _only_ by the
> > iwarp driver stack. Similarly, if the peer closes the TCP connection (or
> > aborts), then the local iwarp driver stack will move the QP out of RTS into
> > CLOSING or ERROR. I think this is different from IB and the IBCM. This in
> > connection with the IWCM maintaining a reference on the QP until the iwarp
> > driver posts a CLOSED event for its iw_cm_id is causing the problem here.
I'm
> > not justifying the existing logic, just trying to explain.
> >
> > Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm
> > to
> > fix this problem... I guess the main problem is that qp disconnect and
> destroy
> > basically cannot be done in the event handler. Granted it works if you only
> > destroy the qp associated with the cm_id handling the event, but that still
> > causes a deadlock, its just that the event handler can return and thus
unblock
> > the deadlock.
>
> I think I can fix this in iw_cxgb4. Stay tuned.
Nope. Without refactoring to the IWCM (somehow), I don't know how to fix this.
I think we should go ahead with Ming's patches, perhaps add a FIXME comment, and
I can look into fix this IWCM. Unless you have another approach that only
disconnects/flushes/destroys the cm_id/qp that received the device removal event
at the end of the handler upcall...
More information about the Linux-nvme
mailing list