nvmet_rdma crash - DISCONNECT event with NULL queue
sagi at grimberg.me
Tue Nov 1 09:15:55 PDT 2016
> Hey guys,
> I just hit an nvmf target NULL pointer deref BUG after a few hours of keep-alive
> timeout testing. It appears that nvmet_rdma_cm_handler() was called with
> cm_id->qp == NULL, so the local nvmet_rdma_queue * variable queue is left as
> NULL. But then nvmet_rdma_queue_disconnect() is called with queue == NULL which
> causes the crash.
AFAICT, the only way cm_id->qp is NULL is for a scenario we didn't even
get to allocate a queue-pair (e.g. calling rdma_create_qp). The teardown
paths does not nullify cm_id->qp...
Are you sure that the flow is indeed DISCONNECTED event?
> In the log, I see that the target side keep-alive fired:
> [20676.867545] eth2: link up, 40Gbps, full-duplex, Tx/Rx PAUSE
> [20677.079669] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [20677.079684] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
Wow, two keep-alive timeouts on the same controller? that is
> [20677.088402] nvmet_rdma: freeing queue 276
> [20677.090981] BUG: unable to handle kernel NULL pointer dereference at
> [20677.090988] IP: [<ffffffffa084b6b4>] nvmet_rdma_queue_disconnect+0x24/0x90
No stack trace?
> So maybe there is just a race in that keep-alive can free the queue and yet a
> DISCONNECTED event still received on the cm_id after the queue is freed?
rdma_destroy_id should barrier this scenario.
More information about the Linux-nvme