nvmf/rdma host crash during heavy load and keep alive recovery
Steve Wise
swise at opengridcomputing.com
Wed Aug 10 11:59:34 PDT 2016
> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:
>
> ctrl = {
> state = NVME_CTRL_RECONNECTING,
> lock = {
>
> So it should not be posting SQ WRs...
kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
nvme_cancel_request() on each request. nvme_cancel_request() sets req->errors
to NVME_SC_ABORT_REQ. It then completes the request which ends up at
nvme_rdma_complete_rq() which queues it for retry:
...
if (unlikely(rq->errors)) {
if (nvme_req_needs_retry(rq, rq->errors)) {
nvme_requeue_req(rq);
return;
}
if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
error = rq->errors;
else
error = nvme_error_status(rq->errors);
}
...
The retry will end up at nvme_rdma_queue_rq() which will try and post a send wr
to a freed qp...
Should the canceled requests actually OR in bit NVME_SC_DNR? That is only done
in nvme_cancel_request() if the blk queue is dying:
...
status = NVME_SC_ABORT_REQ;
if (blk_queue_dying(req->q))
status |= NVME_SC_DNR;
...
Sagi, please put on your KATO hat and help! :)
Thanks,
Steve.
More information about the Linux-nvme
mailing list