[PATCHv3 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info

Nilay Shroff nilay at linux.ibm.com
Mon Sep 9 23:44:08 PDT 2024



On 9/9/24 18:10, Daniel Wagner wrote:
> On Tue, Sep 03, 2024 at 07:22:19PM GMT, Nilay Shroff wrote:
>> NVMe native multipath supports different IO policies for selecting I/O
>> path, however we don't have any visibility about which path is being
>> selected by multipath code for forwarding I/O.
>> This patch helps add that visibility by adding new sysfs attribute files
>> named "numa_nodes" and "queue_depth" under each namespace block device
>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>> under head disk node and then from this directory add a link to each
>> namespace path device this head disk node points to.
>>
>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>> each path the head disk node <nvmeXnY> points to:
>>
>> $ ls -1 /sys/block/nvme1n1/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>> being preferred by the respective block device path. The numa nodes value
>> is comma delimited list of nodes or A-B range of nodes.
>>
>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>> number of active/in-flight I/O requests currently queued for each path.
> 
> As far I can tell, this looks good to me.
> 
>> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
>> +		char *buf)
>> +{
>> +	int node;
>> +	nodemask_t numa_nodes;
>> +	struct nvme_ns *current_ns;
>> +	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>> +	struct nvme_ns_head *head = ns->head;
>> +
>> +	nodes_clear(numa_nodes);
>> +
>> +	for_each_node(node) {
>> +		current_ns = srcu_dereference(head->current_path[node],
>> +				&head->srcu);
> 
> Don't you need to use srcu_read_lock() first?
Yeah I overlooked it. You're right, I need to use srcu_read_lock(). I will 
update it in the next revision of the patch.

> 
>> +		if (ns == current_ns)
>> +			node_set(node, numa_nodes);
> 
> And if ns matches current_ns can't you break the loop?
I think no. Here we map each numa node preferred for a namespace path.
So for instance, if the current_ns (nvmeXcYnZ) path  is preferred for I/O 
workload running on numa nodes i and j then we accordingly set the
respective node values (i and j) in the nodemask (numa_nodes) which is a 
bitmap. 

So in practice, when we cat this file,
# cat  /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes 
0-1

we get the output node 0 and 1. That implies that I/O workload targeted to 
/dev/nvme1n1 and running on numa node 0 or 1 would prefer using the path 
nvme1c0n1. 

And similarly, when we cat this file,
# cat  /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3

we get the output node 2 and 3. That implies that I/O workload targeted to 
/dev/nvme1n1 and running on numa node 2 or 3 would prefer using the path 
nvme1c1n1. 

Hope that clarifies. Let me know if you still think this doesn't make sense.

> 
> Thanks,
> Daniel

Thanks,
--Nilay



More information about the Linux-nvme mailing list