[PATCH RFC 0/1] Add visibility for native NVMe miltipath using debugfs
Nilay Shroff
nilay at linux.ibm.com
Mon Jul 22 02:31:08 PDT 2024
Hi,
This patch propose adding a new debugfs file entry for NVMe native
multipath. As we know NVMe native multipath today supports three different
io-policies (numa, round-robin and queue-depth) for selecting optimal I/O
path and forwarding data. However we don't have yet any visibility to find
the I/O path being selected by NVMe native multipath code.
IMO, it'd be nice to have this visibility information available under
debugfs which could help a user to validate the I/O path being chosen is
optimal for a given io policy. This patch propose adding a debugfs file
for each head disk node on the system. The proposal is to create a file
named "multipath" under "/sys/kernel/debug/nvmeXnY/".
Please find below output generated with this patch applied on a system
with a multi-controller PCIe NVMe disk attached to it. This system is also
an NVMf-TCP host which is connected to NVMf-TCP target over two NIC cards.
This system has two numa nodes online when the below output was captured:
# cat /sys/devices/system/node/online
2-3
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nvmet_subsystem nvme1, nvme3
nvme-subsys2 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme0, nvme2
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys2 nvme2n2
nvme2 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys2 nvme2n2
nvme1 1 a224673364d1dcb6fab9 Linux 6.9.0 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys1 nvme1n1
nvme3 2 a224673364d1dcb6fab9 Linux 6.9.0 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys1 nvme1n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3
/dev/nvme2n2 /dev/ng2n2 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme0, nvme2
# cat /sys/class/nvme-subsystem/nvme-subsys2/iopolicy
numa
# cat /sys/kernel/debug/block/nvme2n2/multipath
io-policy: numa
io-path:
--------
node current-path ctrl ana-state
2 nvme2c2n2 nvme2 optimized
3 nvme2c0n2 nvme0 optimized
The above output shows that current selected iopolicy is numa. And when we
have workload running I/O on numa node 2, accessing namespace "nvme2n2",
it uses path nvme2c2n2 and controller nvme2 for forwarding data. Moreover
the current ana-state for this path is optimized. Similarly, for I/O
workload running on numa node 3 would use path nvme2c0n2 and controller
nvme0.
Now changing the iopolicy to round-robin,
# echo "round-robin" > /sys/class/nvme-subsystem/nvme-subsys2/iopolicy
# cat /sys/kernel/debug/block/nvme2n2/multipath
io-policy: round-robin
io-path:
--------
node rr-path ctrl ana-state
2 nvme2c2n2 nvme2 optimized
2 nvme2c0n2 nvme0 optimized
3 nvme2c2n2 nvme2 optimized
3 nvme2c0n2 nvme0 optimized
The above output shows that current selected iopolicy is round-robin, and
when we have I/O workload running on numa node 2, accessing namespace
"nvme2n2", the I/O path would toggle between nvme2c2n2/nvme2 and
nvme2c0n2/nvme0. And the same is true for I/O workload running on node 3.
Both I/O paths are currently optimized.
The namespace "nvme1n1" is accessible over fabric(NVMf-TCP).
# cat /sys/kernel/debug/block/nvme1n1/multipath
io-policy: queue-depth
io-path:
--------
node path ctrl qdepth ana-state
2 nvme1c1n1 nvme1 1328 optimized
2 nvme1c3n1 nvme3 1324 optimized
3 nvme1c1n1 nvme1 1328 optimized
3 nvme1c3n1 nvme3 1324 optimized
The above output was captured while I/O was running and accessing
namespace nvme1n1. From the above output, we see that iopolicy is set to
"queue-depth". When we have I/O workload running on numa node 2, accessing
namespace "nvme1n1", the I/O path nvme1c1n1/nvme1 has queue depth of 1328
and another I/O path nvme1c3n1/nvme3 has queue depth of 1324. Both paths
are optimized and seems that both paths are equally utilized for
forwarding I/O. The same could be said for workload running on numa
node 3.
Nilay Shroff (1):
nvme-multipath: Add debugfs entry for showing multipath info
drivers/nvme/host/multipath.c | 92 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
2 files changed, 93 insertions(+)
--
2.45.2
More information about the Linux-nvme
mailing list