[PATCH v2] nvme: Fix regression when disconnect a recovering ctrl

Chaitanya Kulkarni chaitanyak at nvidia.com
Mon Jun 27 16:42:55 PDT 2022


On 6/22/22 23:45, Ruozhu Li wrote:
> We encountered a problem that the disconnect command hangs.
> After analyzing the log and stack, we found that the triggering
> process is as follows:
> CPU0                          CPU1
>                                  nvme_rdma_error_recovery_work
>                                    nvme_rdma_teardown_io_queues
> nvme_do_delete_ctrl                 nvme_stop_queues
>    nvme_remove_namespaces
>    --clear ctrl->namespaces
>                                      nvme_start_queues
>                                      --no ns in ctrl->namespaces
>      nvme_ns_remove                  return(because ctrl is deleting)
>        blk_freeze_queue
>          blk_mq_freeze_queue_wait
>          --wait for ns to unquiesce to clean infligt IO, hang forever
> 
> This problem was not found in older kernels because we will flush
> err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
> seem to be modified for functional reasons, the patch can be revert
> to solve the problem.
> 
> Revert commit 794a4cb3d2f7 ("nvme: remove the .stop_ctrl callout")
> 

without looking into the code, do you have any idea if fc and/or loop
transport also suffer from similar issue ?

-ck




More information about the Linux-nvme mailing list