[PATCH] nvme-core: fix deadlock when reconnect failed due to nvme_set_queue_count timeout

Wed Aug 5 05:42:16 EDT 2020

On 2020/8/5 16:22, Sagi Grimberg wrote:
> 
>>>> A deadlock happens When we test nvme over roce with link blink. The
>>>> reason: link blink will cause error recovery, and then reconnect.If
>>>> reconnect fail due to nvme_set_queue_count timeout, the reconnect
>>>> process will set the queue count as 0 and continue , and then
>>>> nvme_start_ctrl will call nvme_enable_aen, and deadlock happens
>>>> because the admin queue is quiesced.
>>>
>>> Why is the admin queue quiesced? if we are calling set_queue_count
>>> it was already unquiesced?
>> nvme_set_queue_count timeout will nvme_rdma_teardown_admin_queue
> 
> Not in the patchset I sent.
Yes, the nvme_rdma_teardown_admin_queue is already deleted. The connect
should terminated instead of continue, because continue can not success,
it is just an ineffective action. .
> 
>> , the admin queue
>> will be quiesced in nvme_rdma_teardown_admin_queue.
>>>
>>>> log:
>>>> Aug  3 22:47:24 localhost kernel: nvme nvme2: I/O 22 QID 0 timeout
>>>> Aug  3 22:47:24 localhost kernel: nvme nvme2: Could not set queue count
>>>> (881)
>>>> stack:
>>>> root     23848  0.0  0.0      0     0 ?        D    Aug03   0:00
>>>> [kworker/u12:4+nvme-wq]
>>>> [<0>] blk_execute_rq+0x69/0xa0
>>>> [<0>] __nvme_submit_sync_cmd+0xaf/0x1b0 [nvme_core]
>>>> [<0>] nvme_features+0x73/0xb0 [nvme_core]
>>>> [<0>] nvme_start_ctrl+0xa4/0x100 [nvme_core]
>>>> [<0>] nvme_rdma_setup_ctrl+0x438/0x700 [nvme_rdma]
>>>> [<0>] nvme_rdma_reconnect_ctrl_work+0x22/0x30 [nvme_rdma]
>>>> [<0>] process_one_work+0x1a7/0x370
>>>> [<0>] worker_thread+0x30/0x380
>>>> [<0>] kthread+0x112/0x130
>>>> [<0>] ret_from_fork+0x35/0x40
>>>>
>>>> Many functions which call __nvme_submit_sync_cmd treat error code in two
>>>> modes: If error code less than 0, treat as command failed. If erroe code
>>>> more than 0, treat as target not support or other.
>>>
>>> We rely in a lot of places on the nvme status being returned from
>>> nvme_submit_sync_cmd (especially in nvme_revalidate_disk and for
>>> path/aborted cancellations), and this patch breaks it. You need to find
>>> a solution that does not hide the nvme status code from propagating
>>> back.
>> The difference is just EINTR and EIO, there is no real impact.
> 
> It's not EIO, its propagating back the nvme status. And we need the
> nvme status back to not falsely remove namespaces when we have
> ns scanning during controller resets or network disconnects.
I see your point. I think we already falsely remove namespaces now
if return error is EAGAIN or EBUSY etc. Maybe we need improve the
error code treat for nvme_revalidate_disk. like this:
---
  drivers/nvme/host/core.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 865645577f2c..d2a61798e9a1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2022,10 +2022,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
          * Only fail the function if we got a fatal error back from the
          * device, otherwise ignore the error and just move on.
          */
-       if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR)))
-               ret = 0;
-       else if (ret > 0)
+       if (ret > 0 && (ret & NVME_SC_DNR))
                 ret = blk_status_to_errno(nvme_error_status(ret));
+       else if (ret != -ENODEV)
+               ret = 0;
         return ret;
  }

-- 
2.16.4

> 
> So as I said, you need to solve this issue without preventing the
> nvme status propagate back.
> .
We can solve this issue by check the status. But we need modify many
funtions, this is ugly. To distinguish other error codes, we may need
return EINTR for host cancel request(NVME_SC_HOST_ABORTED_CMD or
NVME_SC_HOST_PATH_ERROR), the caller need treat EINTR as canceled
request. It might be more appropriate.