[RFC PATCH v2 0/1] Add visibility for native NVMe multipath using sysfs

Nilay Shroff nilay at linux.ibm.com
Fri Aug 9 10:29:56 PDT 2024


Hi,

This patch propose adding new sysfs attributes for adding visibility of
native multipath I/O. The previous version of this RFC[1] proposed using 
debugfs for visibility however the general feedback was to instead export 
the multipath I/O information using sysfs attributes and then latter parse
and format those sysfs attributes using libnvme/nvme-cli. 

As we know, NVMe native multipath supports three different io policies 
(numa, round-robin and queue-depth) for selecting I/O path, however, we 
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC propose adding three new sysfs attribute 
files named "numa", "round_robin", and "queue_depth" under the namespace 
head block path /sys/block/<nvmeXnY>/multipath/. As the name suggests, 
attribute file named "numa" shall show the information about multipath I/O 
for numa poilcy, "round_robin" shall show the information about multipath 
I/O for round-robin policy and  "queue_depth" shall show the information 
about multipath I/O for queue-depth policy.

Please find below output generated with this proposed RFC patch applied on 
a system with two multi-controller PCIe NVMe disks attached to it. This 
system is also an NVMf-TCP host which is connected to an NVMf-TCP target 
over two NIC cards. This system has four numa nodes online when the below 
output was captured:

# cat /sys/devices/system/node/online 
0-3

# lscpu
<snip>
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):
  NUMA node1 CPU(s):      0-7
  NUMA node2 CPU(s):      8-31
  NUMA node3 CPU(s):      32-63
<snip>

Please note that numa node 0 though online, doesn't have any CPU assigned to it.

# nvme list -v 
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys0     nvmet_subsystem                                                                                  nvme0, nvme3
nvme-subsys2     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme1, nvme2
nvme-subsys4     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme4, nvme5

Device           Cntlid SN                   MN                                       FR       TxPort Address        Slot   Subsystem    Namespaces      
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0    1      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100        nvme-subsys0 nvme0n1
nvme1    2      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0   U50EE.001.WZS000E-P3-C4-R1 nvme-subsys2 nvme2n2
nvme2    1      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0   U50EE.001.WZS000E-P3-C4-R2 nvme-subsys2 nvme2n2
nvme3    2      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100        nvme-subsys0 nvme0n1
nvme4    65     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0   U50EE.001.WZS000E-P3-C14-R2 nvme-subsys4 nvme4n1
nvme5    66     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0   U50EE.001.WZS000E-P3-C14-R1 nvme-subsys4 nvme4n1

Device            Generic           NSID       Usage                      Format           Controllers     
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme0n1 /dev/ng0n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme0, nvme3
/dev/nvme2n2 /dev/ng2n2   0x2          0.00   B /   5.75  GB      4 KiB +  0 B   nvme1, nvme2
/dev/nvme4n1 /dev/ng4n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme4, nvme5

As we could see above, we've three shared namespaces created and for each 
shared namespace there's a head disk node created. In terms of iopolicy, 
we have "queue-depth" configured for nvme-subsys0, "numa" configured 
for nvme-subsys2 and "round-robin" configured for nvme-subsys4.

# cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
queue-depth
# cat /sys/class/nvme-subsystem/nvme-subsys2/iopolicy
numa
# cat /sys/class/nvme-subsystem/nvme-subsys4/iopolicy
round-robin

Now, under each "namespace head disk node", we create sysfs attributes 
"numa", "round-robin" and "queue-depth" as shown below. 

# tree /sys/block/nvme0n1/multipath/
/sys/block/nvme0n1/multipath/
├── numa
├── queue_depth
└── round_robin

0 directories, 3 files

# tree /sys/block/nvme2n2/multipath/
/sys/block/nvme2n2/multipath/
├── numa
├── queue_depth
└── round_robin

0 directories, 3 files

# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── numa
├── queue_depth
└── round_robin

0 directories, 3 files

The mutlipath I/O information for each of the above head disk node and 
it's respective iopolicy is then respresented as below.

# cat /sys/block/nvme0n1/multipath/queue_depth 
nvme0c0n1 423
nvme0c3n1 425

In the above output, we print the multipath I/O information about the 
queue_depth iopolicy as the head disk node "nvme0n1" is defined under 
nvme-subsys0 which has queue-depth iopolicy configured. For queue-depth 
iopolicy, it doesn't matter on which numa node IO workload is running. So 
we print each I/O path once and its respective current queue depth.

# cat /sys/block/nvme2n2/multipath/numa 
node1: nvme2c1n2
node2: nvme2c2n2
node3: nvme2c2n2

In the above output, we print the multipath I/O information about the numa 
iopolicy as the head disk node "nvme2n2" is defined under nvme-subsys2 
which has numa iopolicy configured. For numa iopolicy, we print the 
preferred I/O path for each numa node.

# cat /sys/block/nvme4n1/multipath/round_robin 
node1: nvme4c4n1 nvme4c5n1 
node2: nvme4c4n1 nvme4c5n1 
node3: nvme4c5n1 nvme4c4n1 

In the above output, we print the multipath I/O information about the 
round-robin iopolicy as the head disk node "nvme4n1" is defined under 
nvme-subsys4 which has round-robin iopolicy configured. For round-robin 
iopolicy, we print the per node I/O paths in round-robin fashion.

Please note that we don't print ana_state information in the above output 
because that could be retrieved from another sysfs attribute per I/O path. 
Also we don't print I/O controller in the output because that can be also 
easily made up from the respective I/O path. For instance, I/O path 
nvme4c5n1 is accessible from controller nvme5. 

IMO, once we have above information exported through sysfs, we shall be 
able to easily parse it in libnvme/nvme-cli and use it to further extend 
the output of nvme show-topology command or add a new nvme-cli command.

[1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/

Changes since v1:
    - Use sysfs to export multipath I/O information instead of debugfs
	 
Nilay Shroff (1):
  nvme-multipath: Add sysfs attributes for showing multipath info

 drivers/nvme/host/multipath.c |   2 +-
 drivers/nvme/host/nvme.h      |   1 +
 drivers/nvme/host/sysfs.c     | 108 ++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+), 1 deletion(-)

-- 
2.45.2




More information about the Linux-nvme mailing list