nvmf/rdma host crash during heavy load and keep alive recovery
Sagi Grimberg
sagi at grimberg.me
Wed Aug 17 11:57:34 PDT 2016
>> If that is the case, I think we need to have a closer look at
>> nvme_stop_queues...
>>
>
> request_queue->queue_flags does have QUEUE_FLAG_STOPPED set:
>
> #define QUEUE_FLAG_STOPPED 2 /* queue is stopped */
>
> crash> request_queue.queue_flags -x 0xffff880397a13d28
> queue_flags = 0x1f07a04
> crash> request_queue.mq_ops 0xffff880397a13d28
> mq_ops = 0xffffffffa084b140 <nvme_rdma_mq_ops>
>
> So it appears the queue is stopped, yet a request is being processed for that
> queue. Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a request
> is scheduled?
Umm. When the keep-alive timeout triggers we stop the queues. only 10
seconds (or reconnect_delay) later we free the queues and reestablish
them, so I find it hard to believe that a request was queued, and spent
so long in queue_rq until we freed the queue-pair.
From you description of the sequence it seems that after 10 seconds we
attempt a reconnect and during that time an IO request crashes the
party.
I assume this means you ran traffic during the sequence yes?
More information about the Linux-nvme
mailing list