nvmf/rdma host crash during heavy load and keep alive recovery

Wed Aug 17 11:57:34 PDT 2016

>> If that is the case, I think we need to have a closer look at
>> nvme_stop_queues...
>>
>
> request_queue->queue_flags does have QUEUE_FLAG_STOPPED set:
>
> #define QUEUE_FLAG_STOPPED      2       /* queue is stopped */
>
> crash> request_queue.queue_flags -x 0xffff880397a13d28
>   queue_flags = 0x1f07a04
> crash> request_queue.mq_ops 0xffff880397a13d28
>   mq_ops = 0xffffffffa084b140 <nvme_rdma_mq_ops>
>
> So it appears the queue is stopped, yet a request is being processed for that
> queue.  Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a request
> is scheduled?

Umm. When the keep-alive timeout triggers we stop the queues. only 10
seconds (or reconnect_delay) later we free the queues and reestablish
them, so I find it hard to believe that a request was queued, and spent
so long in queue_rq until we freed the queue-pair.

 From you description of the sequence it seems that after 10 seconds we
attempt a reconnect and during that time an IO request crashes the
party.

I assume this means you ran traffic during the sequence yes?