[PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs

Nilay Shroff nilay at linux.ibm.com
Tue Sep 10 23:26:40 PDT 2024


Hi,

This patch propose adding new sysfs attributes for adding visibility of
native multipath I/O. 

The first version of this RFC[1] proposed using debugfs for visibility 
however the general feedback was to instead export the multipath I/O 
information using sysfs attributes and then latter parse and format those 
sysfs attributes using libnvme/nvme-cli. 

The second version of this RFC[2] uses sysfs however the sysfs attribute 
file contains multiple lines of output and the feedback was to instead
follow the principal of one value per one attribute. 

The third version of this RFC[3] follows the one value per one attrbiute
principal. There was a review comment about using srcu read lock while
dereferencing the namespace for each node which is protected by the srcu
lock.

So now the fourth version of this RFC ensures that we protect the 
namespace dereference code with the srcu read lock.

As we know, NVMe native multipath supports three different io policies 
(numa, round-robin and queue-depth) for selecting I/O path, however, we  
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new 
sysfs attribute files named "numa_nodes" and "queue_depth" under each 
namespace block device path /sys/block/nvmeXcYnZ/. We also create a 
"multipath" sysfs directory under head disk node and then from this 
directory add a link to each namespace path device this head disk node 
points to.

Please find below output generated with this proposed RFC patch applied on  
a system with two multi-controller PCIe NVMe disks attached to it. This 
system is also an NVMf-TCP host which is connected to an NVMf-TCP target 
over two NIC cards. This system has four numa nodes online when the below 
output was captured:

# cat /sys/devices/system/node/online 
0-3

# lscpu
<snip>
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):
  NUMA node1 CPU(s):      0-7
  NUMA node2 CPU(s):      8-31
  NUMA node3 CPU(s):      32-63
<snip>

Please note that numa node 0 though online, doesn't have any CPU 
currently assigned to it.

# nvme list -v 
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme0, nvme1
nvme-subsys3     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme2, nvme3
nvme-subsys4     nvmet_subsystem                                                                                  nvme4, nvme5

Device           Cntlid SN                   MN                                       FR       TxPort Address        Slot   Subsystem    Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0    66     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0   U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1    65     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0   U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2    2      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0   U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3    1      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0   U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4    1      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100        nvme-subsys4 nvme4n1
nvme5    2      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100        nvme-subsys4 nvme4n1

Device            Generic           NSID       Usage                      Format           Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1   0x2          0.00   B /   5.75  GB      4 KiB +  0 B   nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme4, nvme5


# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- ns 1
 \
  +- nvme0 pcie 052e:78:00.0 live optimized
  +- nvme1 pcie 058e:78:00.0 live optimized

nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=round-robin
\
 +- ns 2
 \
  +- nvme2 pcie 0524:28:00.0 live optimized
  +- nvme3 pcie 0584:28:00.0 live optimized

nvme-subsys4 - NQN=nvmet_subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=queue-depth
\
 +- ns 1
 \
  +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
  +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized

As we could see above, we've three shared namespaces created. In terms of 
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin" 
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.

Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the 
each path this head disk node points to:

# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1

# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1

# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1

One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at 
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state 
of each path is optimized (as can be seen in the output of show-topology).

For numa I/O policy, configured under nvme-subsys1, the "numa_nodes" 
attribute file shows the numa nodes being preferred by the respective 
namespace path. The numa nodes value is comma delimited list of nodes or 
A-B range of nodes.

# cat  /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes 
0-1

# cat  /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3

>From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1. 
Similarly, I/O workload running on numa nodes 2 and 3 would use path 
nvme1c1n1.

For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth" 
attribute file shows the number of active/in-flight I/O requests currently 
queued for each path.

# cat  /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth 
518

# cat  /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504

>From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth 
of each path is 518 and 504 respectively.

[1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
[2] https://lore.kernel.org/all/20240809173030.2281021-2-nilay@linux.ibm.com/
[3] https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/

Changes since v3:
    - Protect the namespace dereference code with srcu read lock (Daniel Wagner)

Changes since v2:
    - Use one value per one sysfs attribute (Keith Busch)

Changes since v1:
    - Use sysfs to export multipath I/O information instead of debugfs

Nilay Shroff (1):
  nvme-multipath: Add sysfs attributes for showing multipath info

 drivers/nvme/host/core.c      |  3 ++
 drivers/nvme/host/multipath.c | 69 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      | 20 ++++++++--
 drivers/nvme/host/sysfs.c     | 20 ++++++++++
 4 files changed, 108 insertions(+), 4 deletions(-)

-- 
2.45.2




More information about the Linux-nvme mailing list