[PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy

Tue Dec 10 05:02:09 PST 2024

On 12/10/24 14:52, Daniel Wagner wrote:
> There is typo in the subject.
oh yes, I will fix it in the next patch.
> 
> On Wed, Oct 30, 2024 at 04:11:41PM +0530, Nilay Shroff wrote:
>> This patch helps add nvme native multipath visibility for round-robin
>> io-policy. It creates a "multipath" sysfs directory under head gendisk
>> device node directory and then from "multipath" directory it adds a link
>> to each namespace path device the head node refers.
>>
>> For instance, if we have a shared namespace accessible from two different
>> controllers/paths then we create a soft link to each path device from head
>> disk node as shown below:
>>
>> $ ls -l /sys/block/nvme1n1/multipath/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> In the above example, nvme1n1 is head gendisk node created for a shared
>> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
>> paths.
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
>> ---
>>  drivers/nvme/host/core.c      |  3 ++
>>  drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
>>  drivers/nvme/host/nvme.h      | 18 ++++++--
>>  drivers/nvme/host/sysfs.c     | 14 ++++++
>>  4 files changed, 112 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 84cb859a911d..c923bd2c5468 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3940,6 +3940,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>  
>>  	if (!nvme_ns_head_multipath(ns->head))
>>  		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> +	nvme_mpath_remove_sysfs_link(ns);
>> +
>>  	del_gendisk(ns->disk);
>>  
>>  	mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 6a15873055b9..43c87a946cc7 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -682,6 +682,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>  		kblockd_schedule_work(&head->partition_scan_work);
>>  	}
>>  
>> +	nvme_mpath_add_sysfs_link(ns->head);
>> +
>>  	mutex_lock(&head->lock);
>>  	if (nvme_path_is_optimized(ns)) {
>>  		int node, srcu_idx;
>> @@ -764,6 +766,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>  	if (nvme_state_is_live(ns->ana_state) &&
>>  	    nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>>  		nvme_mpath_set_live(ns);
>> +	else {
>> +		/*
>> +		 * Add sysfs link from multipath head gendisk node to path
>> +		 * device gendisk node.
>> +		 * If path's ana state is live (i.e. state is either optimized
>> +		 * or non-optimized) while we alloc the ns then sysfs link would
>> +		 * be created from nvme_mpath_set_live(). In that case we would
>> +		 * not fallthrough this code path. However for the path's ana
>> +		 * state other than live, we call nvme_mpath_set_live() only
>> +		 * after ana state transitioned to the live state. But we still
>> +		 * want to create the sysfs link from head node to a path device
>> +		 * irrespctive of the path's ana state.
>> +		 * If we reach through here then it means that path's ana state
>> +		 * is not live but still create the sysfs link to this path from
>> +		 * head node if head node of the path has already come alive.
>> +		 */
>> +		if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
>> +			nvme_mpath_add_sysfs_link(ns->head);
>> +	}
> 
> Remind me, why do we add the link here? Couldn't we add it earlier and
> so we don't have to check the ctrl state? At least I read this here that
> we want to add the link independent of the ctrl state.
> 
Yes we always add link earlier for a case when path's ANA state is LIVE (i.e ANA state 
is either optimized or non-optimized). However there's a scenario when we allocate the
namespace path and its ANA state is anything but LIVE and in that case we would fall
through the above code path. 

Assume we create a shared namespace and attach it two different controllers. One of the
path of this shared namespace is advertising its ANA state as non-LIVE and the another 
path is advertising its ANA state as LIVE.
Now, if first we allocate a namespace path, which is advertising ANA state as non-LIVE,
then in that case the head disk node wouldn't come live and without the head disk node 
we can't create the sysfs link to that namespace path. Later when we allocate the second
path, which is advertising ANA sate as LIVE, the head disk node comes live, and so we can
now create the sysfs link to both paths. 

So it's important to check the state of head disk node in the above code path. Idea here's
that we always want to create the sysfs link from head disk node to each namespace path
irrespective of path's ANA state. However we could do that only when at-least one path is
advertising its ANA state as LIVE so that its corresponding head disk node would come live
and then we could create sysfs link to all paths. In fact, this requirement was proposed by
Hannes[1] and so I implemented it in v5 patchset.

> Or do you want to say whenever the link exist the path is usable? Trying
> to figure out if this is a good idea or not when nvme-cli/libnvme
> iterates over sysfs and it races with this, what's the user experience?
> 
If a link to path from head node exists then it doesn't mean always path is usable. We need
to read its ANA state to determine whether path is usable or not.

> I am sorry if this has been already discussed. I just completely lost
> the state :)

No worries..:) Please let me know if you have any further comments/questions.

[1] https://lore.kernel.org/all/3c92b9ae-1223-4787-90e3-955f82161269@suse.de/

Thanks,
--Nilay