nvmf/rdma host crash during heavy load and keep alive recovery

Wed Aug 10 11:59:34 PDT 2016

> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:
> 
>   ctrl = {
>     state = NVME_CTRL_RECONNECTING,
>     lock = {
> 
> So it should not be posting SQ WRs...

kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
nvme_cancel_request() on each request.  nvme_cancel_request() sets req->errors
to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
nvme_rdma_complete_rq() which queues it for retry:
...
        if (unlikely(rq->errors)) {
                if (nvme_req_needs_retry(rq, rq->errors)) {
                        nvme_requeue_req(rq);
                        return;
                }

                if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
                        error = rq->errors;
                else
                        error = nvme_error_status(rq->errors);
        }
...

The retry will end up at nvme_rdma_queue_rq() which will try and post a send wr
to a freed qp...

Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only done
in nvme_cancel_request() if the blk queue is dying:

...
        status = NVME_SC_ABORT_REQ;
        if (blk_queue_dying(req->q))
                status |= NVME_SC_DNR;
...

Sagi, please put on your KATO hat and help! :)

Thanks,

Steve.