nvmf/rdma host crash during heavy load and keep alive recovery

Thu Sep 8 01:22:18 PDT 2016

>> Now, given that you already verified that the queues are stopped with
>> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
>>
>> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
>> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
>> pairs after we passed these checks while the rq_list is being processed
>> then we can end-up with this condition, but given that it takes
>> essentially forever (10 seconds) I tend to doubt this is the case.
>>
>> HCH, Jens, Keith, any useful pointers for us?
>>
>> To summarize we see a stray request being queued long after we set
>> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).
>
> Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the tag set
> for that queue as part of reconnecting?

I don't see how that'd help...