[PATCH 2/2] nvme: don't freeze/unfreeze queues from different contexts

Tue Jun 13 06:13:42 PDT 2023

> For block layer freeze/unfreeze APIs, the caller is required to call the
> two in strict pair, so most of users simply call them from same context,
> and everything works just fine.
> 
> For NVMe, the two are done from different contexts, this way has caused
> all kinds of IO hang issue, such as:
> 
> 1) When io queue connect fails, this controller is deleted without being
> marked as DEAD. Upper layer may wait forever in __bio_queue_enter(), because
> in del_gendisk(), disk won't be marked as DEAD unless bdev sync & invalidate
> returns. If any writeback IO waits in __bio_queue_enter(), IO deadlock is
> caused. Reported from Yi Zhang.
> 
> 2) error recovering vs. namespace deletiong, if any IO originated from
> scan work is waited in __bio_queue_enter(), flushing scan work hangs for
> ever in nvme_remove_namespaces() because controller is left as frozen
> when error recovery is interrupted by controller removal. Reported from
> Chunguang.
> 
> Fix the issue by calling the two in same context just when reset is done
> and not starting freeze from beginning of error recovery. Not only IO hang
> is solved, correctness of freeze & unfreeze is respected.
> 
> And this way is correct because quiesce is enough for driver to handle
> error recovery. The only difference is where to wait during error recovery.
> With this way, IO is just queued in block layer queue instead of
> __bio_queue_enter(), finally waiting for completion is done in upper
> layer. Either way, IO can't move on during error recovery.
> 
> Reported-by: Chunguang Xu <brookxu.cn at gmail.com>
> Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/
> Reported-by: Yi Zhang <yi.zhang at redhat.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   drivers/nvme/host/core.c | 4 +---
>   drivers/nvme/host/pci.c  | 8 +++++---
>   drivers/nvme/host/rdma.c | 3 ++-
>   drivers/nvme/host/tcp.c  | 3 ++-
>   4 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 4ef5eaecaa75..d5d9b6f6ec74 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4707,10 +4707,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>   	 * removing the namespaces' disks; fail all the queues now to avoid
>   	 * potentially having to clean up the failed sync later.
>   	 */
> -	if (ctrl->state == NVME_CTRL_DEAD) {
> +	if (ctrl->state == NVME_CTRL_DEAD)
>   		nvme_mark_namespaces_dead(ctrl);
> -		nvme_unquiesce_io_queues(ctrl);
> -	}

Shouldn't this be in the next patch? Not sure what
this helps in this patch? It is not clearly documented
in the commit msg.

>   
>   	/* this is a no-op when called from the controller reset handler */
>   	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 492f319ebdf3..5d775b76baca 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2578,14 +2578,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	dead = nvme_pci_ctrl_is_dead(dev);
>   	if (dev->ctrl.state == NVME_CTRL_LIVE ||
>   	    dev->ctrl.state == NVME_CTRL_RESETTING) {
> -		if (pci_is_enabled(pdev))
> -			nvme_start_freeze(&dev->ctrl);
>   		/*
>   		 * Give the controller a chance to complete all entered requests
>   		 * if doing a safe shutdown.
>   		 */
> -		if (!dead && shutdown)
> +		if (!dead && shutdown & pci_is_enabled(pdev)) {
> +			nvme_start_freeze(&dev->ctrl);
>   			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> +			nvme_unfreeze(&dev->ctrl);
> +		}

I'd split out the pci portion, it is not related to the reported issue,
and it is structured differently than the fabrics transports (as for now
at least).

>   	}
>   
>   	nvme_quiesce_io_queues(&dev->ctrl);
> @@ -2740,6 +2741,7 @@ static void nvme_reset_work(struct work_struct *work)
>   	 * controller around but remove all namespaces.
>   	 */
>   	if (dev->online_queues > 1) {
> +		nvme_start_freeze(&dev->ctrl);
>   		nvme_unquiesce_io_queues(&dev->ctrl);
>   		nvme_wait_freeze(&dev->ctrl);
>   		nvme_pci_update_nr_queues(dev);
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 0eb79696fb73..354cce8853c1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -918,6 +918,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   		goto out_cleanup_tagset;
>   
>   	if (!new) {
> +		nvme_start_freeze(&ctrl->ctrl);
>   		nvme_unquiesce_io_queues(&ctrl->ctrl);
>   		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
>   			/*
> @@ -926,6 +927,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   			 * to be safe.
>   			 */
>   			ret = -ENODEV;
> +			nvme_unfreeze(&ctrl->ctrl);

What does this unfreeze designed to do?