target crash / host hang with nvme-all.3 branch of nvme-fabrics

Tue Jun 28 14:04:11 PDT 2016

On Tue, 2016-06-28 at 14:43 -0500, Steve Wise wrote:
> > I'm using a ram disk for the target.  Perhaps before
> > I was using a real nvme device.  I'll try that too and see if I still hit this
> > deadlock/stall...
> > 
> 
> Hey Ming,
> 
> Seems using a real nvme device at the target vs a ram device, avoids this new
> deadlock issue.  And I'm running so-far w/o the usual touch-after-free crash.
> Usually I hit it quickly.   It looks like your patch did indeed fix that.  So:
> 
> 1) We need to address Christoph's concern that your fix isn't the ideal/correct
> solution.  How do you want to proceed on that angle?  How can I help?

This one should be more correct.
Actually, the rsp was leaked when queue->state is
NVMET_RDMA_Q_DISCONNECTING. So we should put it back.

It works for me. Could you help to verify?

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 425b55c..ee8b85e 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -727,6 +727,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 		spin_lock_irqsave(&queue->state_lock, flags);
 		if (queue->state == NVMET_RDMA_Q_CONNECTING)
 			list_add_tail(&rsp->wait_list, &queue->rsp_wait_list);
+		else
+			nvmet_rdma_put_rsp(rsp);
 		spin_unlock_irqrestore(&queue->state_lock, flags);
 		return;
 	}

> 
> 2) the deadlock below is probably some other issue.  Looks more like a cxgb4
> problem at first glance.  I'll look into this one...
> 
> Steve.
> 
>