[bug report] nvme/rdma: nvme connect failed after offline one cpu on host side
Sagi Grimberg
sagi at grimberg.me
Mon Jul 4 16:04:53 PDT 2022
> update the subject to better describe the issue:
>
> So I tried this issue on one nvme/rdma environment, and it was also
> reproducible, here are the steps:
>
> # echo 0 >/sys/devices/system/cpu/cpu0/online
> # dmesg | tail -10
> [ 781.577235] smpboot: CPU 0 is now offline
> # nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn
> Failed to write to /dev/nvme-fabrics: Invalid cross-device link
> no controller found: failed to write to nvme-fabrics device
>
> # dmesg
> [ 781.577235] smpboot: CPU 0 is now offline
> [ 799.471627] nvme nvme0: creating 39 I/O queues.
> [ 801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues.
> [ 801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
> [ 801.073059] nvme nvme0: failed to connect queue: 1 ret=-18
This is because of blk_mq_alloc_request_hctx() and was raised before.
IIRC there was reluctance to make it allocate a request for an hctx even
if its associated mapped cpu is offline.
The latest attempt was from Ming:
[PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx
Don't know where that went tho...
More information about the Linux-nvme
mailing list