nvmf/rdma host crash during heavy load and keep alive recovery

Thu Aug 11 06:58:38 PDT 2016

> >> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING
> state:
> >>
> >>   ctrl = {
> >>     state = NVME_CTRL_RECONNECTING,
> >>     lock = {
> >>
> >> So it should not be posting SQ WRs...
> >
> > kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
> > nvme_cancel_request() on each request.  nvme_cancel_request() sets
req->errors
> > to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
> > nvme_rdma_complete_rq() which queues it for retry:
> > ...
> >         if (unlikely(rq->errors)) {
> >                 if (nvme_req_needs_retry(rq, rq->errors)) {
> >                         nvme_requeue_req(rq);
> >                         return;
> >                 }
> >
> >                 if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
> >                         error = rq->errors;
> >                 else
> >                         error = nvme_error_status(rq->errors);
> >         }
> > ...
> >
> > The retry will end up at nvme_rdma_queue_rq() which will try and post a send
wr
> > to a freed qp...
> >
> > Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only
> done
> > in nvme_cancel_request() if the blk queue is dying:
> 
> the DNR bit should not be set normally, only when we either don't want
> to requeue or we can't.
> 
> >
> > ...
> >         status = NVME_SC_ABORT_REQ;
> >         if (blk_queue_dying(req->q))
> >                 status |= NVME_SC_DNR;
> > ...
> >
> 
> couple of questions:
> 
> 1. bringing down the interface means generating DEVICE_REMOVAL
> event?
> 

No.  Just ifconfig ethX down; sleep 10; ifconfig ethX up.  This simply causes
the pending work requests to take longer to complete and kicks in the kato
logic.

> 2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
> kicks error_recovery and set:
> rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR
> 
> So I'm not at all convinced that the keep-alive is the request that
> being re-issued. Did you verify that?

The request that caused the crash had rq->errors == NVME_SC_ABORT_REQ.  I'm not
sure that is always the case though.  But this is very easy to reproduce, so I
should be able to drill down and add any debug code you think might help.