[RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Sun Sep 21 04:12:20 PDT 2025
Hi,
This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies—numa, round-robin, and queue-depth—are
static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.
The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency and, for fabrics, the negotiated link
speed. Latency is derived by passively sampling I/O completions. Link
speed is queried from the adapter and factored into path scoring. Each
path is assigned a weight proportional to its score, and I/Os are then
forwarded accordingly. As conditions change (e.g. latency spikes,
bandwidth differences), path weights are updated, automatically
steering traffic toward better-performing paths.
Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
This pathcset includes totla 5 patches:
[PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting()
- Make blk_stat APIs available to block drivers.
- Needed for per-path latency measurement in adaptive policy.
[PATCH 2/5] nvme-multipath: add adaptive I/O policy
- Implement path scoring based on latency (EWMA).
- Distribute I/O proportionally to per-path weights.
[PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive policy
- Introduce "adp_stat" under nvme path block device.
- Provide observability of latency, weight, and selection stats.
[PATCH 4/5] nvme-tcp: export NIC link speed
- Retrieve negotiated link speed (Mbps) from the adapter.
- Expose via sysfs for visibility/debugging.
[PATCH 5/5] nvme-multipath: factor link speed into path scoring
- Adjust adaptive path weights using link speed as a multiplier.
- Favor higher bandwidth links while still considering latency.
Currently, link speed reporting is implemented only for TCP NICs.
Support for Fibre Channel adapters will follow in a future patch.
As ususal, feedback and suggestions are most welcome!
Thanks!
Nilay Shroff (5):
block: expose blk_stat_{enable,disable}_accounting() to drivers
nvme-multipath: add support for adaptive I/O policy
nvme-multipath: add sysfs attribute for adaptive I/O policy
nvmf-tcp: add support for retrieving adapter link speed
nvme-multipath: factor fabric link speed into path score
block/blk-stat.h | 4 -
drivers/nvme/host/core.c | 10 +-
drivers/nvme/host/ioctl.c | 7 +-
drivers/nvme/host/multipath.c | 441 +++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 38 ++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 12 +-
drivers/nvme/host/tcp.c | 66 +++++
include/linux/blk-mq.h | 4 +
9 files changed, 562 insertions(+), 26 deletions(-)
--
2.51.0
More information about the Linux-nvme
mailing list