nvmf/rdma host crash during heavy load and keep alive recovery

Wed Sep 7 14:33:40 PDT 2016

> Now, given that you already verified that the queues are stopped with
> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
> 
> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
> pairs after we passed these checks while the rq_list is being processed
> then we can end-up with this condition, but given that it takes
> essentially forever (10 seconds) I tend to doubt this is the case.
> 
> HCH, Jens, Keith, any useful pointers for us?
> 
> To summarize we see a stray request being queued long after we set
> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).

Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the tag set
for that queue as part of reconnecting?