[PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs
Nilay Shroff
nilay at linux.ibm.com
Tue Sep 10 23:26:40 PDT 2024
Hi,
This patch propose adding new sysfs attributes for adding visibility of
native multipath I/O.
The first version of this RFC[1] proposed using debugfs for visibility
however the general feedback was to instead export the multipath I/O
information using sysfs attributes and then latter parse and format those
sysfs attributes using libnvme/nvme-cli.
The second version of this RFC[2] uses sysfs however the sysfs attribute
file contains multiple lines of output and the feedback was to instead
follow the principal of one value per one attribute.
The third version of this RFC[3] follows the one value per one attrbiute
principal. There was a review comment about using srcu read lock while
dereferencing the namespace for each node which is protected by the srcu
lock.
So now the fourth version of this RFC ensures that we protect the
namespace dereference code with the srcu read lock.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
nvme-subsys4 nvmet_subsystem nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=numa
\
+- ns 1
\
+- nvme0 pcie 052e:78:00.0 live optimized
+- nvme1 pcie 058e:78:00.0 live optimized
nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=round-robin
\
+- ns 2
\
+- nvme2 pcie 0524:28:00.0 live optimized
+- nvme3 pcie 0584:28:00.0 live optimized
nvme-subsys4 - NQN=nvmet_subsystem
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=queue-depth
\
+- ns 1
\
+- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
+- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:
# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).
For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.
# cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1
# cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3
>From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.
For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.
# cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518
# cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504
>From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.
[1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
[2] https://lore.kernel.org/all/20240809173030.2281021-2-nilay@linux.ibm.com/
[3] https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/
Changes since v3:
- Protect the namespace dereference code with srcu read lock (Daniel Wagner)
Changes since v2:
- Use one value per one sysfs attribute (Keith Busch)
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
Nilay Shroff (1):
nvme-multipath: Add sysfs attributes for showing multipath info
drivers/nvme/host/core.c | 3 ++
drivers/nvme/host/multipath.c | 69 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++++++--
drivers/nvme/host/sysfs.c | 20 ++++++++++
4 files changed, 108 insertions(+), 4 deletions(-)
--
2.45.2
More information about the Linux-nvme
mailing list