[RFC PATCHv2 0/4] nvme-multipath: introduce adaptive I/O policy

Nilay Shroff nilay at linux.ibm.com
Thu Oct 9 03:05:22 PDT 2025


Hi,

This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies such as numa, round-robin, and queue-depth
are static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.

The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency for both PCIe and fabrics. Latency is
derived by passively sampling I/O completions. Each path is assigned a 
weight proportional to its latency score, and I/Os are then forwarded 
accordingly. As condition changes (e.g. latency spikes, bandwidth 
differences), path weights are updated, automatically steering traffic
toward better-performing paths.

Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:

        numa         round-robin   queue-depth  adaptive
        -----------  -----------   -----------  ---------
READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
        W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s

This pathcset includes totla 5 patches:
[PATCH 1/4] block: expose blk_stat_{enable,disable}_accounting()
  - Make blk_stat APIs available to block drivers.
  - Needed for per-path latency measurement in adaptive policy.

[PATCH 2/4] nvme-multipath: add adaptive I/O policy
  - Implement path scoring based on latency (EWMA).
  - Distribute I/O proportionally to per-path weights.

[PATCH 3/4] nvme: add generic debugfs support
  - Introduce generic debugfs support for NVMe module

[PATCH 4/4] nvme-multipath: add debugfs attribute for adaptive I/O policy stats
  - Add “adaptive_stat” under per-path and head debugfs directories to
    expose adaptive policy state and statistics.

As ususal, feedback and suggestions are most welcome!

Thanks!

Changes from v1:
  - Ensure that the completion of I/O occurs on the same CPU as the
    submitting I/O CPU (Hannes Reinecke)
  - Remove adapter link speed from the path weight calculation
    (Hannes Reinecke)
  - Add adaptive I/O stat under debugfs instead of current sysfs
    (Hannes Reinecke)  
  - Move path weight calculation to a workqueue from IO completion
    code path

Nilay Shroff (4):
  block: expose blk_stat_{enable,disable}_accounting() to drivers
  nvme-multipath: add support for adaptive I/O policy
  nvme: add generic debugfs support
  nvme-multipath: add debugfs attribute for adaptive I/O policy stats

 block/blk-stat.h              |   4 -
 drivers/nvme/host/Makefile    |   2 +-
 drivers/nvme/host/core.c      |  13 +-
 drivers/nvme/host/debugfs.c   | 239 ++++++++++++++++++++
 drivers/nvme/host/ioctl.c     |   7 +-
 drivers/nvme/host/multipath.c | 400 ++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h      |  55 ++++-
 drivers/nvme/host/pr.c        |   6 +-
 drivers/nvme/host/sysfs.c     |   2 +-
 include/linux/blk-mq.h        |   4 +
 10 files changed, 705 insertions(+), 27 deletions(-)
 create mode 100644 drivers/nvme/host/debugfs.c

-- 
2.51.0




More information about the Linux-nvme mailing list