nvmf/rdma host crash during heavy load and keep alive recovery

Wed Aug 10 23:27:52 PDT 2016

On 10/08/16 21:59, Steve Wise wrote:
>
>> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:
>>
>>   ctrl = {
>>     state = NVME_CTRL_RECONNECTING,
>>     lock = {
>>
>> So it should not be posting SQ WRs...
>
> kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
> nvme_cancel_request() on each request.  nvme_cancel_request() sets req->errors
> to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
> nvme_rdma_complete_rq() which queues it for retry:
> ...
>         if (unlikely(rq->errors)) {
>                 if (nvme_req_needs_retry(rq, rq->errors)) {
>                         nvme_requeue_req(rq);
>                         return;
>                 }
>
>                 if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
>                         error = rq->errors;
>                 else
>                         error = nvme_error_status(rq->errors);
>         }
> ...
>
> The retry will end up at nvme_rdma_queue_rq() which will try and post a send wr
> to a freed qp...
>
> Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only done
> in nvme_cancel_request() if the blk queue is dying:

the DNR bit should not be set normally, only when we either don't want
to requeue or we can't.

>
> ...
>         status = NVME_SC_ABORT_REQ;
>         if (blk_queue_dying(req->q))
>                 status |= NVME_SC_DNR;
> ...
>

couple of questions:

1. bringing down the interface means generating DEVICE_REMOVAL
event?

2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
kicks error_recovery and set:
rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR

So I'm not at all convinced that the keep-alive is the request that
being re-issued. Did you verify that?