[PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout

Tue Aug 3 02:06:28 PDT 2021

Hi,

We got a BUG_ON when rescan ns after set queue count cmd timeout: 
--
BUG_ON(hctx_idx >= ctrl->ctrl.queue_count); //nvme_rdma_init_hctx
--
Call trace:
nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
blk_mq_realloc_hw_ctxs+0x140/0x4c0
blk_mq_init_allocated_queue+0x130/0x410
blk_mq_init_queue+0x40/0x88
nvme_validate_ns+0xb8/0x740
nvme_scan_work+0x29c/0x460
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138
ret_from_fork+0x10/0x18
--
This happened because: 
1) Host set queue count feature timeout in reconnection, set ctrl->
queue_count to 1, and schedule another reconnect. 
2) Next reconnection succeed but not create any io queues, because
ctrl->queue_count set to 1, host won't configure io queue again.
3) Del/add ns on ctrl causes host rescan ns, kernel BUG_ON when detect
hctx_idx greater than ctrl->queue_count.

Try to fix it with following patches.Any comments and reviews are welcome.

Thanks,
Ruozhu

Ruozhu Li (2):
  nvme-rdma: always try to configure io queue when user wants it
  nvme: don't do scan work if io queue count is zero

 drivers/nvme/host/core.c | 6 ++++--
 drivers/nvme/host/rdma.c | 4 +++-
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.16.4