[PATCH] nvme-multipath: fix possible hang in live ns resize with ANA access

Hannes Reinecke hare at suse.de
Thu Sep 29 23:29:54 PDT 2022


On 9/28/22 15:58, Sagi Grimberg wrote:
> When we revalidate paths as part of ns size change (as of commit e7d65803e2bb),
> it is possible that during the path revalidation, the only paths that is IO
> capable (i.e. optimized/non-optimized) are the ones that ns resize was not yet
> informed to the host, which will cause inflight requests to be requeued (as we
> have available paths but none are IO capable). These requests on the requeue
> list are waiting for someone to resubmit them at some point.
> 
> The IO capable paths will eventually notify the ns resize change to the
> host, but there is nothing that will kick the requeue list to resubmit the
> queued requests.
> 
> Fix this by always kicking the requeue list, and if no IO capable path exists,
> these requests will be queued again.
> 
> A typical log that indicates that IOs are requeued:
> --
> nvme nvme1: creating 4 I/O queues.
> nvme nvme1: new ctrl: "testnqn1"
> nvme nvme2: creating 4 I/O queues.
> nvme nvme2: mapped 4/0/0 default/read/poll queues.
> nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
> nvme nvme1: rescanning namespaces.
> nvme1n1: detected capacity change from 2097152 to 4194304
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> block nvme1n1: no usable path - requeuing I/O
> nvme nvme2: rescanning namespaces.
> --
> 
> Reported-by: Yogev Cohen <yogev at lightbitslabs.com>
> Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> I was easily capable to reproduce this regression with a small debug patch to nvmet
> to have a 1 second delay between controller AENs:
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
>   #include "trace.h"
>   
>   #include "nvmet.h"
> +#include <linux/delay.h>
>   
>   struct workqueue_struct *buffered_io_wq;
>   struct workqueue_struct *zbd_wq;
> @@ -248,6 +249,7 @@ void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid)
>                  nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE,
>                                  NVME_AER_NOTICE_NS_CHANGED,
>                                  NVME_LOG_CHANGED_NS);
> +               msleep(1000);
>          }
>   }
> 
> And expose a subsystem via two ports, one ANA 'optimized', and one ANA 'inaccessible'.
> 
> Then test file-backed ns resize duing I/O:
> truncate --size=2G /tmp/f && echo 1 > /sys/kernel/config/nvmet/subsystems/testnqn1/namespaces/1/revalidate_size
> 
>   drivers/nvme/host/multipath.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 6ef497c75a16..d532a78f24a2 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -172,16 +172,18 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>   void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>   {
>   	struct nvme_ns_head *head = ns->head;
> +	struct nvme_ns *n;
>   	sector_t capacity = get_capacity(head->disk);
>   	int node;
>   
> -	list_for_each_entry_rcu(ns, &head->list, siblings) {
> -		if (capacity != get_capacity(ns->disk))
> -			clear_bit(NVME_NS_READY, &ns->flags);
> +	list_for_each_entry_rcu(n, &head->list, siblings) {
> +		if (capacity != get_capacity(n->disk))
> +			clear_bit(NVME_NS_READY, &n->flags);
>   	}
>   
>   	for_each_node(node)
>   		rcu_assign_pointer(head->current_path[node], NULL);
> +	nvme_kick_requeue_lists(ns->ctrl);
>   }
>   
>   static bool nvme_path_is_disabled(struct nvme_ns *ns)

Hmm. I guess the real issue is that we use 'ns' as both function 
argument _and_ iterator, so that after 'list_for_each_rcu()' 'ns' is 
pointing to a _different_ namespace, such that we end up kicking that 
namespace and not the one used as argument.
But in either case, patch is fine.

Reviewed-by: Hannes Reinecke <hare at suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman




More information about the Linux-nvme mailing list