nvmf/rdma host crash during heavy load and keep alive recovery

Thu Sep 15 07:00:26 PDT 2016

"Steve Wise" <swise at opengridcomputing.com> writes:

> @@ -622,6 +625,7 @@ static void nvme_rdma_stop_and_free_queue(struct
> nvme_rdma_queue *queue)
>  {
>         if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
>                 return;
> +       BUG_ON(!test_bit(BLK_MQ_S_STOPPED, &queue->hctx->state));
>         nvme_rdma_stop_queue(queue);
>         nvme_rdma_free_queue(queue);
>  }
> @@ -1408,6 +1412,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>
>         WARN_ON_ONCE(rq->tag < 0);
>
> +       BUG_ON(hctx != queue->hctx);
> +       BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
>         dev = queue->device->dev;
>         ib_dma_sync_single_for_cpu(dev, sqe->dma,
>                         sizeof(struct nvme_command), DMA_TO_DEVICE);
>

This reminds me of the discussion I had with Jens a few weeks ago here:

http://lists.infradead.org/pipermail/linux-nvme/2016-August/005916.html

The BUG_ON I hit is similar to yours, but for nvme over PCI.  I think
the update queues code will reach a similar path of remapping, but I
didnt go out and check yet.

Can you check you are running with the patch he mentioned at:

http://lists.infradead.org/pipermail/linux-nvme/2016-August/005962.html

Thanks,

-- 
Gabriel Krisman Bertazi