Disconnecting nvmet-rdma

Thu Oct 20 00:49:43 PDT 2016

> how do you reproduce the timedwait condition?  My RDMA test setup is
> still being moved, so I can't reproduce it myself, but I'd like to know
> for the future.

+1 on that, never was able to step on this.

> The only reason why I could see a NULL queue here is if RDMA/CM
> also calls the timedwait exit handler for the listener CM ids,
> in which case your patch would be correct.  Can you check for that
> theory by printing the cm_id address in nvmet_rdma_add_port and in
> nvmet_rdma_cm_handler?

Where do you see indication for that in the code? TIMEWAIT doesn't make
sense for listener cm_ids. The CM enters timewait (starts a timer) when:
- sends a disconnect request
- sends a disconnect reply
- sends a connect reject
- received a connect reject

Non of those happen with the listener cm_id. Maybe I'm missing
something?

Note that the cm_id->qp is NULL which means that it was
never created (destroy doesn't nullify it).

 From looking at the code there are two flows that can trigger this:
- we failed nvmet_rdma_queue_connect() but didn't destroy
   the cm_id -> which triggered this event later (but I can't find
   indication of that in the code)

- Something in the CM spaghetti triggered this after we accepted
   but the client rejected us for some reason (although I think we should
   have seen UNREACHABLE event...

Or something else...

Bart, more information on what happened exactly (and how) would help
here.

> Also is there any chance you could try your reproducer with the iSER target
> as well?  It also seems to blindly derference the queue.

It probably will happen too...