[PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
Nilay Shroff
nilay at linux.ibm.com
Wed Oct 30 03:41:40 PDT 2024
Hi,
This RFC propose adding new sysfs attributes for adding visibility of
nvme native multipath I/O.
The changes are divided into three patches.
The first patch adds visibility for round-robin io-policy.
The second patch adds visibility for numa io-policy.
The third patch adds the visibility for queue-depth io-policy.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
nvme-subsys4 nvmet_subsystem nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=numa
\
+- ns 1
\
+- nvme0 pcie 052e:78:00.0 live optimized
+- nvme1 pcie 058e:78:00.0 live optimized
nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=round-robin
\
+- ns 2
\
+- nvme2 pcie 0524:28:00.0 live optimized
+- nvme3 pcie 0584:28:00.0 live optimized
nvme-subsys4 - NQN=nvmet_subsystem
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=queue-depth
\
+- ns 1
\
+- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
+- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:
# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).
For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.
# cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1
# cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3
>From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.
For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.
# cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518
# cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504
>From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.
changes since v4:
- Ensure that we create sysfs link from head gendisk node to each path
device irrespective of the ANA state of the path (Hannes Reinecke)
- Split the patch into three patch series and add commentary in the
code so that it's easy to read and understand the core logic (Sagi
Grimberg)
- Don't show any output if user reads "numa_nodes" file and configured
iopolicy is anything but numa; similarly don't emit any output if user
reads "queue_depth" file and configured iopolicy is anything but
queue-depth (Sagi Grimberg)
Changes since v3:
- Protect the namespace dereference code with srcu read lock (Daniel Wagner)
Changes since v2:
- Use one value per one sysfs attribute (Keith Busch)
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
Nilay Shroff (3):
nvme-multipah: Add visibility for round-robin io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for queue-depth io-policy
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++--
drivers/nvme/host/sysfs.c | 20 ++++++
4 files changed, 159 insertions(+), 4 deletions(-)
--
2.45.2
More information about the Linux-nvme
mailing list