[PATCH] nvme: don't wait freeze during resetting

Wed Sep 21 01:19:21 PDT 2022

On 9/21/22 04:25, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 11:18:33AM +0300, Sagi Grimberg wrote:
>>
>>> First it isn't necessary to call nvme_wait_freeze during reset.
>>> For nvme-pci, if tagset isn't allocated, there can't be any inflight
>>> IOs; otherwise blk_mq_update_nr_hw_queues can freeze & wait queues.
>>>
>>> Second, since commit bdd6316094e0 ("block: Allow unfreezing of a queue
>>> while requests are in progress"), it is fine to unfreeze queue without
>>> draining inflight IOs.
>>>
>>> Also both nvme-rdma and nvme-tcp's timeout handler provides forward
>>> progress if the controller state isn't LIVE, so it is fine to drop
>>> the timeout function of nvme_wait_freeze_timeout().
>>
>> The rdma/tcp should probably be split to separate patches.
>>
>>>
>>> Cc: Sagi Grimberg <sagi at grimberg.me>
>>> Cc: Chao Leng <lengchao at huawei.com>
>>> Cc: Keith Busch <kbusch at kernel.org>
>>> Signed-off-by: Ming Lei <ming.lei at redhat.com>
>>> ---
>>>    drivers/nvme/host/apple.c |  1 -
>>>    drivers/nvme/host/pci.c   |  1 -
>>>    drivers/nvme/host/rdma.c  | 13 -------------
>>>    drivers/nvme/host/tcp.c   | 13 -------------
>>>    4 files changed, 28 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
>>> index 5fc5ea196b40..9cd02b57fc85 100644
>>> --- a/drivers/nvme/host/apple.c
>>> +++ b/drivers/nvme/host/apple.c
>>> @@ -1126,7 +1126,6 @@ static void apple_nvme_reset_work(struct work_struct *work)
>>>    	anv->ctrl.queue_count = nr_io_queues + 1;
>>>    	nvme_start_queues(&anv->ctrl);
>>> -	nvme_wait_freeze(&anv->ctrl);
>>>    	blk_mq_update_nr_hw_queues(&anv->tagset, 1);
>>>    	nvme_unfreeze(&anv->ctrl);
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index 98864b853eef..985b216907fc 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -2910,7 +2910,6 @@ static void nvme_reset_work(struct work_struct *work)
>>>    		nvme_free_tagset(dev);
>>>    	} else {
>>>    		nvme_start_queues(&dev->ctrl);
>>> -		nvme_wait_freeze(&dev->ctrl);
>>>    		if (!dev->ctrl.tagset)
>>>    			nvme_pci_alloc_tag_set(dev);
>>>    		else
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 3100643be299..beb0d1a6a84d 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -986,15 +986,6 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>>>    	if (!new) {
>>>    		nvme_start_queues(&ctrl->ctrl);
>>> -		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
>>> -			/*
>>> -			 * If we timed out waiting for freeze we are likely to
>>> -			 * be stuck.  Fail the controller initialization just
>>> -			 * to be safe.
>>> -			 */
>>> -			ret = -ENODEV;
>>> -			goto out_wait_freeze_timed_out;
>>> -		}
>>
>> So here is the description from the patch that introduced this:
>> --
>> nvme-rdma: fix reset hang if controller died in the middle of a reset
>>
>> If the controller becomes unresponsive in the middle of a reset, we
>> will hang because we are waiting for the freeze to complete, but that
>> cannot happen since we have commands that are inflight holding the
>> q_usage_counter, and we can't blindly fail requests that times out.
>>
>> So give a timeout and if we cannot wait for queue freeze before
>> unfreezing, fail and have the error handling take care how to
>> proceed (either schedule a reconnect of remove the controller).
>> --
>>
>> So if between nvme_start_queues() and the freeze (with a full wait)
>> that is done in blk_mq_update_nr_hw_queues() the controller becomes
>> non responsive, in this case we may hang blocking on I/O that was
>> pending and requeued after nvme_start_queues().
>>
>> The problem is, that we cannot do any error recovery because the
>> controller is in the middle of a reset/reconnect...
>> So the code that you deleted was designed to detect this state, and
>> reschedule another reconnect if the controller became non responsive.
>>
>> What is preventing this from happening now?
> 
> Please see nvme_rdma_timeout() & nvme_tcp_timeout(), if controller state
> isn't live, request will be aborted.

I agree with you. However non-mpath devices will most likely retry the
command and not fail it like in the multipath case (see 
nvme_decide_disposition) and will cause the I/O to block.

While it is arguable if non-mpath fabrics devices are important in any
capacity, the design was that IO is not completed until the controller
either successfully reconnects (and retried), or it disconnects
(failed), or fast_io_fail_tmo expires.

Hence for non-mpath controllers, the request(s) will timeout, and
aborted, but nvme will opt to retry them instead of completing them
with a failure (at least until fast_io_fail_tmo expires, but that can
be arbitrarily long).