[PATCH] nvme: don't wait freeze during resetting

Sagi Grimberg sagi at grimberg.me
Tue Sep 20 01:18:33 PDT 2022


> First it isn't necessary to call nvme_wait_freeze during reset.
> For nvme-pci, if tagset isn't allocated, there can't be any inflight
> IOs; otherwise blk_mq_update_nr_hw_queues can freeze & wait queues.
> 
> Second, since commit bdd6316094e0 ("block: Allow unfreezing of a queue
> while requests are in progress"), it is fine to unfreeze queue without
> draining inflight IOs.
> 
> Also both nvme-rdma and nvme-tcp's timeout handler provides forward
> progress if the controller state isn't LIVE, so it is fine to drop
> the timeout function of nvme_wait_freeze_timeout().

The rdma/tcp should probably be split to separate patches.

> 
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Cc: Chao Leng <lengchao at huawei.com>
> Cc: Keith Busch <kbusch at kernel.org>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   drivers/nvme/host/apple.c |  1 -
>   drivers/nvme/host/pci.c   |  1 -
>   drivers/nvme/host/rdma.c  | 13 -------------
>   drivers/nvme/host/tcp.c   | 13 -------------
>   4 files changed, 28 deletions(-)
> 
> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> index 5fc5ea196b40..9cd02b57fc85 100644
> --- a/drivers/nvme/host/apple.c
> +++ b/drivers/nvme/host/apple.c
> @@ -1126,7 +1126,6 @@ static void apple_nvme_reset_work(struct work_struct *work)
>   	anv->ctrl.queue_count = nr_io_queues + 1;
>   
>   	nvme_start_queues(&anv->ctrl);
> -	nvme_wait_freeze(&anv->ctrl);
>   	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>   	nvme_unfreeze(&anv->ctrl);
>   
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 98864b853eef..985b216907fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2910,7 +2910,6 @@ static void nvme_reset_work(struct work_struct *work)
>   		nvme_free_tagset(dev);
>   	} else {
>   		nvme_start_queues(&dev->ctrl);
> -		nvme_wait_freeze(&dev->ctrl);
>   		if (!dev->ctrl.tagset)
>   			nvme_pci_alloc_tag_set(dev);
>   		else
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 3100643be299..beb0d1a6a84d 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -986,15 +986,6 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   
>   	if (!new) {
>   		nvme_start_queues(&ctrl->ctrl);
> -		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
> -			/*
> -			 * If we timed out waiting for freeze we are likely to
> -			 * be stuck.  Fail the controller initialization just
> -			 * to be safe.
> -			 */
> -			ret = -ENODEV;
> -			goto out_wait_freeze_timed_out;
> -		}

So here is the description from the patch that introduced this:
--
nvme-rdma: fix reset hang if controller died in the middle of a reset

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).
--

So if between nvme_start_queues() and the freeze (with a full wait)
that is done in blk_mq_update_nr_hw_queues() the controller becomes
non responsive, in this case we may hang blocking on I/O that was
pending and requeued after nvme_start_queues().

The problem is, that we cannot do any error recovery because the
controller is in the middle of a reset/reconnect...
So the code that you deleted was designed to detect this state, and
reschedule another reconnect if the controller became non responsive.

What is preventing this from happening now?

I wish we had a test for this... It is very difficult to hit because the
controller needs to become non-responsive exactly at this point in the
reset...

>   		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
>   			ctrl->ctrl.queue_count - 1);
>   		nvme_unfreeze(&ctrl->ctrl);
> @@ -1002,10 +993,6 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   
>   	return 0;
>   
> -out_wait_freeze_timed_out:
> -	nvme_stop_queues(&ctrl->ctrl);
> -	nvme_sync_io_queues(&ctrl->ctrl);
> -	nvme_rdma_stop_io_queues(ctrl);
>   out_cleanup_connect_q:
>   	nvme_cancel_tagset(&ctrl->ctrl);
>   	if (new)
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index d5871fd6f769..49d9bef806f9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1920,15 +1920,6 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
>   
>   	if (!new) {
>   		nvme_start_queues(ctrl);
> -		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
> -			/*
> -			 * If we timed out waiting for freeze we are likely to
> -			 * be stuck.  Fail the controller initialization just
> -			 * to be safe.
> -			 */
> -			ret = -ENODEV;
> -			goto out_wait_freeze_timed_out;
> -		}
>   		blk_mq_update_nr_hw_queues(ctrl->tagset,
>   			ctrl->queue_count - 1);
>   		nvme_unfreeze(ctrl);
> @@ -1936,10 +1927,6 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
>   
>   	return 0;
>   
> -out_wait_freeze_timed_out:
> -	nvme_stop_queues(ctrl);
> -	nvme_sync_io_queues(ctrl);
> -	nvme_tcp_stop_io_queues(ctrl);
>   out_cleanup_connect_q:
>   	nvme_cancel_tagset(ctrl);
>   	if (new)



More information about the Linux-nvme mailing list