[PATCH v2] nvme: Fix regression when disconnect a recovering ctrl

Mon Jun 27 20:01:18 PDT 2022

On 2022/6/28 7:42, Chaitanya Kulkarni wrote:

> On 6/22/22 23:45, Ruozhu Li wrote:
>> We encountered a problem that the disconnect command hangs.
>> After analyzing the log and stack, we found that the triggering
>> process is as follows:
>> CPU0                          CPU1
>>                                   nvme_rdma_error_recovery_work
>>                                     nvme_rdma_teardown_io_queues
>> nvme_do_delete_ctrl                 nvme_stop_queues
>>     nvme_remove_namespaces
>>     --clear ctrl->namespaces
>>                                       nvme_start_queues
>>                                       --no ns in ctrl->namespaces
>>       nvme_ns_remove                  return(because ctrl is deleting)
>>         blk_freeze_queue
>>           blk_mq_freeze_queue_wait
>>           --wait for ns to unquiesce to clean infligt IO, hang forever
>>
>> This problem was not found in older kernels because we will flush
>> err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
>> seem to be modified for functional reasons, the patch can be revert
>> to solve the problem.
>>
>> Revert commit 794a4cb3d2f7 ("nvme: remove the .stop_ctrl callout")
>>
> without looking into the code, do you have any idea if fc and/or loop
> transport also suffer from similar issue ?
>
> -ck
I am not so familiar with the these transport code. It seems that FC 
will also do
stop\start queue in err work, and there will probably be similar problems.

The loop transport only has reset work, and it will be flushed, so there 
should
be no such problem.