[bug report] nvme/rdma: nvme connect failed after offline one cpu on host side

Mon Jul 4 16:04:53 PDT 2022

> update the subject to better describe the issue:
> 
> So I tried this issue on one nvme/rdma environment, and it was also
> reproducible, here are the steps:
> 
> # echo 0 >/sys/devices/system/cpu/cpu0/online
> # dmesg | tail -10
> [  781.577235] smpboot: CPU 0 is now offline
> # nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn
> Failed to write to /dev/nvme-fabrics: Invalid cross-device link
> no controller found: failed to write to nvme-fabrics device
> 
> # dmesg
> [  781.577235] smpboot: CPU 0 is now offline
> [  799.471627] nvme nvme0: creating 39 I/O queues.
> [  801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues.
> [  801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
> [  801.073059] nvme nvme0: failed to connect queue: 1 ret=-18

This is because of blk_mq_alloc_request_hctx() and was raised before.

IIRC there was reluctance to make it allocate a request for an hctx even
if its associated mapped cpu is offline.

The latest attempt was from Ming:
[PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx

Don't know where that went tho...