[RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node

Fri Apr 18 03:45:55 PDT 2025

On 4/9/25 4:13 PM, Christoph Hellwig wrote:
> On Tue, Apr 08, 2025 at 07:37:48PM +0530, Nilay Shroff wrote:
>>>> +	 * For non-fabric controllers we support delayed removal of head disk
>>>> +	 * node. If we reached up to here then it means that head disk is still
>>>> +	 * alive and so we assume here that even if there's no path available
>>>> +	 * maybe due to the transient link failure, we could queue up the IO
>>>> +	 * and later when path becomes ready we re-submit queued IO.
>>>> +	 */
>>>> +	if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
>>>> +		return true;
>>>
>>> Why is this conditional on fabrics or not?  The same rationale should
>>> apply as much if not more for fabrics controllers.
>>>
>> For fabrics we already have options like "reconnect_delay" and 
>> "max_reconnects". So in case of fabric link failures, we delay 
>> the removal of the head disk node based on those options.
> 
> Yes.  But having entirely different behavior for creating a multipath
> node and removing it still seems like a rather bad idea.
> 
Yes agreed, but as you know for the NVMeoF, connection is retried (for 
instrance fabric link goes down) at the respective fabric driver layer. 
However for NVMe over PCIe, our proposal is to handle the delayed removal
of the multipath head node in the NVMe stacked (multipath) driver layer. 
If we want to unify this behavior across PCIe and fabric controllers then 
we may allow setting delayed removal of head node option even to the fabric 
connection however as we know that fabric already supports such behavior 
at driver layer (by virtue of reconnection_delay and max_reconnects options), 
IMO, it may not be worthwhile to add this (delayed removal of multipath head
node) support for the fabric connection.

If the current proposed implementation seems conditional for fabrics, we may
change it such that we don't have any explicit checks for the fabrics in the
nvme_available_path function and so it could be re-written as follows:

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0f54889bd483..47a3723605f6 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -428,16 +428,6 @@ static bool nvme_available_path(struct nvme_ns_head *head)
        if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
                return NULL;
 
-       /*
-        * For non-fabric controllers we support delayed removal of head disk
-        * node. If we reached up to here then it means that head disk is still
-        * alive and so we assume here that even if there's no path available
-        * maybe due to the transient link failure, we could queue up the IO
-        * and later when path becomes ready we re-submit queued IO.
-        */
-       if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
-               return true;
-
        list_for_each_entry_srcu(ns, &head->list, siblings,
                                 srcu_read_lock_held(&head->srcu)) {
                if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
@@ -451,6 +441,10 @@ static bool nvme_available_path(struct nvme_ns_head *head)
                        break;
                }
        }
+
+       if (head->delayed_shutdown_sec)
+               return true;
+
        return false;
 }
 static void nvme_ns_head_submit_bio(struct bio *bio)


Please note that head->delayed_shutdown_sec is exported to user via sysfs.
The default value of this variable is zero. So it's only when the value 
of head->delayed_shutdown_sec is set to a non-zero, we allow delayed removal 
of head node. Moreover, as discussed above, we may not want to export this 
option to user if the head node refers to the fabric connection. So this 
option shall be exclusively available to non-fabric multipath connections. 

Thanks,
--Nilay