[RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node

Mon Apr 28 00:05:39 PDT 2025

On 4/25/25 8:13 PM, Christoph Hellwig wrote:
> On Fri, Apr 25, 2025 at 04:03:08PM +0530, Nilay Shroff wrote:
>> @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>  
>>  	mutex_lock(&ns->ctrl->subsys->lock);
>>  	list_del_rcu(&ns->siblings);
>> -	if (list_empty(&ns->head->list)) {
>> -		list_del_init(&ns->head->entry);
>> -		last_path = true;
>> -	}
>>  	mutex_unlock(&ns->ctrl->subsys->lock);
>>  
>>  	/* guarantee not available in head->list */
>> @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>  	mutex_unlock(&ns->ctrl->namespaces_lock);
>>  	synchronize_srcu(&ns->ctrl->srcu);
>>  
>> -	if (last_path)
>> -		nvme_mpath_shutdown_disk(ns->head);
>> +	nvme_mpath_shutdown_disk(ns->head);
> 
> This now removes the head list deletion from the first critical
> section into nvme_mpath_shutdown_disk.  I remember we had to do it
> the way it currently is done because we were running into issues
> otherwise, the commit history might help a bit with what the issues
> were.
> 
Thank you for the pointers! Yes, I checked the commit history and found this
commit 9edceaf43050 ("nvme: avoid race in shutdown namespace removal") which
avoids the race between nvme_find_ns_head and nvme_ns_remove. And looking at
the commit history it seems we also need to handle that case here. I think if
user doesn't configure delayed removal of head node then the existing behavior
should be preserved to avoid above mentioned race. However if user configures
the delayed head node removal then we would delay the head->entry removal until
the delayed node removal time elapses. I'd handle this case in next patch.

>> +	if (a == &dev_attr_delayed_removal_secs.attr) {
>> +		struct nvme_ns_head *head = dev_to_ns_head(dev);
>> +		struct gendisk *disk = dev_to_disk(dev);
>> +
>> +		/*
>> +		 * This attribute is only valid for head node and non-fabric
>> +		 * setup.
>> +		 */
>> +		if (!nvme_disk_is_ns_head(disk) ||
>> +				test_bit(NVME_NSHEAD_FABRICS, &head->flags))
>> +			return 0;
>> +	}
> 
> Didn't we say we also want to allow fabrics to use it if explicitly
> configured?
> 
Hmm, I think there's some confusion here. I thought we discussed that for fabric setup, 
fabric link failure is managed through other parameters such as "reconnect_delay" 
and "max_reconnects" which is handled at respective fabric driver layer. Therefor, 
head->delayed_removal_secs is intended exclusively for non-fabric (e.g., PCIe) 
multipath configurations.

Even if we were to support delayed head node removal for fabric setup then I 
wonder who would handle fabric (re-)connect command? As once we exhaust number of 
reconnect attempts, we stop the reconnect work and delete the fabric controller.

So do you think we should still expose head->delayed_removal_secs for fabric setup? 
If yes, then it'd add further additional delay (without re-connect attempt) for 
fabric link failures, isn't it?

Thanks,
--Nilay