[RFC PATCH v2 0/1] Add visibility for native NVMe multipath using sysfs
Nilay Shroff
nilay at linux.ibm.com
Fri Aug 9 10:29:56 PDT 2024
Hi,
This patch propose adding new sysfs attributes for adding visibility of
native multipath I/O. The previous version of this RFC[1] proposed using
debugfs for visibility however the general feedback was to instead export
the multipath I/O information using sysfs attributes and then latter parse
and format those sysfs attributes using libnvme/nvme-cli.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC propose adding three new sysfs attribute
files named "numa", "round_robin", and "queue_depth" under the namespace
head block path /sys/block/<nvmeXnY>/multipath/. As the name suggests,
attribute file named "numa" shall show the information about multipath I/O
for numa poilcy, "round_robin" shall show the information about multipath
I/O for round-robin policy and "queue_depth" shall show the information
about multipath I/O for queue-depth policy.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys0 nvmet_subsystem nvme0, nvme3
nvme-subsys2 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme1, nvme2
nvme-subsys4 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys0 nvme0n1
nvme1 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys2 nvme2n2
nvme2 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys2 nvme2n2
nvme3 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys0 nvme0n1
nvme4 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys4 nvme4n1
nvme5 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme0n1 /dev/ng0n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme3
/dev/nvme2n2 /dev/ng2n2 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme1, nvme2
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
As we could see above, we've three shared namespaces created and for each
shared namespace there's a head disk node created. In terms of iopolicy,
we have "queue-depth" configured for nvme-subsys0, "numa" configured
for nvme-subsys2 and "round-robin" configured for nvme-subsys4.
# cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
queue-depth
# cat /sys/class/nvme-subsystem/nvme-subsys2/iopolicy
numa
# cat /sys/class/nvme-subsystem/nvme-subsys4/iopolicy
round-robin
Now, under each "namespace head disk node", we create sysfs attributes
"numa", "round-robin" and "queue-depth" as shown below.
# tree /sys/block/nvme0n1/multipath/
/sys/block/nvme0n1/multipath/
├── numa
├── queue_depth
└── round_robin
0 directories, 3 files
# tree /sys/block/nvme2n2/multipath/
/sys/block/nvme2n2/multipath/
├── numa
├── queue_depth
└── round_robin
0 directories, 3 files
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── numa
├── queue_depth
└── round_robin
0 directories, 3 files
The mutlipath I/O information for each of the above head disk node and
it's respective iopolicy is then respresented as below.
# cat /sys/block/nvme0n1/multipath/queue_depth
nvme0c0n1 423
nvme0c3n1 425
In the above output, we print the multipath I/O information about the
queue_depth iopolicy as the head disk node "nvme0n1" is defined under
nvme-subsys0 which has queue-depth iopolicy configured. For queue-depth
iopolicy, it doesn't matter on which numa node IO workload is running. So
we print each I/O path once and its respective current queue depth.
# cat /sys/block/nvme2n2/multipath/numa
node1: nvme2c1n2
node2: nvme2c2n2
node3: nvme2c2n2
In the above output, we print the multipath I/O information about the numa
iopolicy as the head disk node "nvme2n2" is defined under nvme-subsys2
which has numa iopolicy configured. For numa iopolicy, we print the
preferred I/O path for each numa node.
# cat /sys/block/nvme4n1/multipath/round_robin
node1: nvme4c4n1 nvme4c5n1
node2: nvme4c4n1 nvme4c5n1
node3: nvme4c5n1 nvme4c4n1
In the above output, we print the multipath I/O information about the
round-robin iopolicy as the head disk node "nvme4n1" is defined under
nvme-subsys4 which has round-robin iopolicy configured. For round-robin
iopolicy, we print the per node I/O paths in round-robin fashion.
Please note that we don't print ana_state information in the above output
because that could be retrieved from another sysfs attribute per I/O path.
Also we don't print I/O controller in the output because that can be also
easily made up from the respective I/O path. For instance, I/O path
nvme4c5n1 is accessible from controller nvme5.
IMO, once we have above information exported through sysfs, we shall be
able to easily parse it in libnvme/nvme-cli and use it to further extend
the output of nvme show-topology command or add a new nvme-cli command.
[1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
Nilay Shroff (1):
nvme-multipath: Add sysfs attributes for showing multipath info
drivers/nvme/host/multipath.c | 2 +-
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/sysfs.c | 108 ++++++++++++++++++++++++++++++++++
3 files changed, 110 insertions(+), 1 deletion(-)
--
2.45.2
More information about the Linux-nvme
mailing list