[PATCH] nvme-core: fix deadlock when reconnect failed due to nvme_set_queue_count timeout

Sagi Grimberg sagi at grimberg.me
Wed Aug 5 04:22:22 EDT 2020


>>> A deadlock happens When we test nvme over roce with link blink. The
>>> reason: link blink will cause error recovery, and then reconnect.If
>>> reconnect fail due to nvme_set_queue_count timeout, the reconnect
>>> process will set the queue count as 0 and continue , and then
>>> nvme_start_ctrl will call nvme_enable_aen, and deadlock happens
>>> because the admin queue is quiesced.
>>
>> Why is the admin queue quiesced? if we are calling set_queue_count
>> it was already unquiesced?
> nvme_set_queue_count timeout will nvme_rdma_teardown_admin_queue

Not in the patchset I sent.

>, the 
> admin queue
> will be quiesced in nvme_rdma_teardown_admin_queue.
>>
>>> log:
>>> Aug  3 22:47:24 localhost kernel: nvme nvme2: I/O 22 QID 0 timeout
>>> Aug  3 22:47:24 localhost kernel: nvme nvme2: Could not set queue count
>>> (881)
>>> stack:
>>> root     23848  0.0  0.0      0     0 ?        D    Aug03   0:00
>>> [kworker/u12:4+nvme-wq]
>>> [<0>] blk_execute_rq+0x69/0xa0
>>> [<0>] __nvme_submit_sync_cmd+0xaf/0x1b0 [nvme_core]
>>> [<0>] nvme_features+0x73/0xb0 [nvme_core]
>>> [<0>] nvme_start_ctrl+0xa4/0x100 [nvme_core]
>>> [<0>] nvme_rdma_setup_ctrl+0x438/0x700 [nvme_rdma]
>>> [<0>] nvme_rdma_reconnect_ctrl_work+0x22/0x30 [nvme_rdma]
>>> [<0>] process_one_work+0x1a7/0x370
>>> [<0>] worker_thread+0x30/0x380
>>> [<0>] kthread+0x112/0x130
>>> [<0>] ret_from_fork+0x35/0x40
>>>
>>> Many functions which call __nvme_submit_sync_cmd treat error code in two
>>> modes: If error code less than 0, treat as command failed. If erroe code
>>> more than 0, treat as target not support or other.
>>
>> We rely in a lot of places on the nvme status being returned from
>> nvme_submit_sync_cmd (especially in nvme_revalidate_disk and for
>> path/aborted cancellations), and this patch breaks it. You need to find
>> a solution that does not hide the nvme status code from propagating
>> back.
> The difference is just EINTR and EIO, there is no real impact.

It's not EIO, its propagating back the nvme status. And we need the
nvme status back to not falsely remove namespaces when we have
ns scanning during controller resets or network disconnects.

So as I said, you need to solve this issue without preventing the
nvme status propagate back.



More information about the Linux-nvme mailing list