[PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
Steve Wise
swise at opengridcomputing.com
Mon Jul 18 08:47:36 PDT 2016
> > > Hey Sagi, here is some lite reading for you. :)
> > >
> > > Prelude: As part of disconnecting an iwarp connection, the iwarp provider
> needs
> > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
> > > singlethread workq thread for iw_cm.
> > >
> > > Here is what happens with Sagi's patch:
> > >
> > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the
> > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in
> the
> > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event
> > handler
> > > for this cm_id, function cm_iw_handler(), which blocks until any currently
> > > running event handler for this cm_id finishes. It does this by calling
> > > cm_disable_callback(). However since this whole unplug process is running
> in
> > > the event handler function for this same cm_id, the iw_cm workq thread is
> now
> > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and
> > > schedules the controller delete worker thread and waits for it to
complete.
> The
> > > delete controller worker thread tries to disconnect and destroy all the
> > > remaining IO queues, but gets stuck in the destroy() path on the first IO
> queue
> > > because the iw_cm workq thread is already stuck, and processing the CLOSE
> > event
> > > is required to release a reference the iw_cm has on the iwarp providers
qp.
> So
> > > everything comes to a grinding halt....
> > >
> > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received
> the
> > > device removal event is disconnected/destroyed _only after_ all the
> controller
> > > queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't
> get
> > > stuck waiting for the controller to delete the io queues, and only after
> that
> > > completes, does it delete the cm_id/qp that got the device removal event.
> It
> > > then returns thus causing the rdma_cm to release the cm_id's callback
mutex.
> > > This causes the iw_cm workq thread to now unblock and we continue on.
(can
> > you
> > > say house of cards?)
> > >
> > > So the net is: the cm_id that received the device remove event _must_ be
> > > disconnect/destroyed _last_.
> >
> > Hey Steve, thanks for the detailed description.
> >
> > IMHO, this really looks like a buggy design in iWARP connection
> > management implementation. The fact that a rdma_disconnect forward
> > progress is dependent on other cm_id execution looks wrong and
> > backwards to me.
> >
>
> Yea. I'm not sure how to fix it at this point...
>
> > Given that the device is being removed altogether I don't think that
> > calling rdma_disconnect is important at all. Given the fact that
> > ib_drain_qp makes sure that the queue-pair is in error state does
> > this incremental patch resolve the iwarp-deadlock you are seeing?
> >
>
> I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.
When
> I implemented them I assumed the QP would be disconnected. Let me try this
and
> report back.
Just doing the drain has the same issue because moving the QP to ERROR does the
same process as a disconnect wrt the provider/IWCM interactions.
The reason for this close provider/iwcm interaction is needed is due to the
IWARP relationship between QPs and the TCP connection (and thus the cm_id). The
iwarp driver stack manipulates the QP state based on the connection events
directly. For example, the transition from IDLE -> RTS is done _only_ by the
iwarp driver stack. Similarly, if the peer closes the TCP connection (or
aborts), then the local iwarp driver stack will move the QP out of RTS into
CLOSING or ERROR. I think this is different from IB and the IBCM. This in
connection with the IWCM maintaining a reference on the QP until the iwarp
driver posts a CLOSED event for its iw_cm_id is causing the problem here. I'm
not justifying the existing logic, just trying to explain.
Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm to
fix this problem... I guess the main problem is that qp disconnect and destroy
basically cannot be done in the event handler. Granted it works if you only
destroy the qp associated with the cm_id handling the event, but that still
causes a deadlock, its just that the event handler can return and thus unblock
the deadlock.
More information about the Linux-nvme
mailing list