nvmet_rdma crash - DISCONNECT event with NULL queue

Tue Nov 1 09:49:03 PDT 2016

> >> So I think that the patch from Bart a few weeks ago was correct:
> >>
> >
> > Not quite.  It just guards against a null queue for TIMEWAIT_EXIT, which is
only
> > generated by the IB_CM.
> 
> Yes, this is why we need ADDR_CHANGE and DISCONNECTED too
> "(and include all the relevant cases around it)"
> 
> The other events we don't get to LIVE state and we don't have
> other error flows that will trigger queue teardown sequence.
> 
> --
> nvmet-rdma: Fix possible NULL deref when handling rdma cm
>   events
> 
> When we initiate queue teardown sequence we call rdma_destroy_qp
> which clears cm_id->qp, afterwards we call rdma_destroy_id, but
> we might see a rdma_cm event in between with a cleared cm_id->qp
> so watch out for that and silently ignore the event because this
> means that the queue teardown sequence is in progress.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche at sandisk.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/target/rdma.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b4d648536c3e..240888efd920 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
>          case RDMA_CM_EVENT_ADDR_CHANGE:
>          case RDMA_CM_EVENT_DISCONNECTED:
>          case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> -               nvmet_rdma_queue_disconnect(queue);
> +               /*
> +                * We might end up here when we already freed the qp
> +                * which means queue release sequence is in progress,
> +                * so don't get in the way...
> +                */
> +               if (!queue)
> +                       nvmet_rdma_queue_disconnect(queue);
>                  break;
>          case RDMA_CM_EVENT_DEVICE_REMOVAL:
>                  ret = nvmet_rdma_device_removal(cm_id, queue);
> --

This looks good.  But you mentioned the 2 rapid-fire keep alive timeout logs for
the same controller as being seriously broken. Perhaps that is another problem?
Maybe keep alives aren't getting stopped correctly or something... 

But:  I'll try this patch and run for a few hours and see what happens.  I
believe regardless of a keep alive issue, the above patch is still needed.

Steve.