nvmet_rdma crash - DISCONNECT event with NULL queue
swise at opengridcomputing.com
Tue Nov 1 09:49:03 PDT 2016
> >> So I think that the patch from Bart a few weeks ago was correct:
> > Not quite. It just guards against a null queue for TIMEWAIT_EXIT, which is
> > generated by the IB_CM.
> Yes, this is why we need ADDR_CHANGE and DISCONNECTED too
> "(and include all the relevant cases around it)"
> The other events we don't get to LIVE state and we don't have
> other error flows that will trigger queue teardown sequence.
> nvmet-rdma: Fix possible NULL deref when handling rdma cm
> When we initiate queue teardown sequence we call rdma_destroy_qp
> which clears cm_id->qp, afterwards we call rdma_destroy_id, but
> we might see a rdma_cm event in between with a cleared cm_id->qp
> so watch out for that and silently ignore the event because this
> means that the queue teardown sequence is in progress.
> Signed-off-by: Bart Van Assche <bart.vanassche at sandisk.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> drivers/nvme/target/rdma.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b4d648536c3e..240888efd920 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
> case RDMA_CM_EVENT_ADDR_CHANGE:
> case RDMA_CM_EVENT_DISCONNECTED:
> case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> - nvmet_rdma_queue_disconnect(queue);
> + /*
> + * We might end up here when we already freed the qp
> + * which means queue release sequence is in progress,
> + * so don't get in the way...
> + */
> + if (!queue)
> + nvmet_rdma_queue_disconnect(queue);
> case RDMA_CM_EVENT_DEVICE_REMOVAL:
> ret = nvmet_rdma_device_removal(cm_id, queue);
This looks good. But you mentioned the 2 rapid-fire keep alive timeout logs for
the same controller as being seriously broken. Perhaps that is another problem?
Maybe keep alives aren't getting stopped correctly or something...
But: I'll try this patch and run for a few hours and see what happens. I
believe regardless of a keep alive issue, the above patch is still needed.
More information about the Linux-nvme