[RFC PATCHv4 0/6] nvme-multipath: introduce adaptive I/O policy

Guixin Liu kanie at linux.alibaba.com
Tue Nov 4 08:57:35 PST 2025


Hi Nilay:

Could you plz update Documentation/admin-guide/nvme-multipath.rst too?

Best Regards,

Guixin Liu

在 2025/11/4 18:45, Nilay Shroff 写道:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
>          numa         round-robin   queue-depth  adaptive
>          -----------  -----------   -----------  ---------
> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>          W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
>
> This pathcset includes totla 6 patches:
> [PATCH 1/6] block: expose blk_stat_{enable,disable}_accounting()
>    - Make blk_stat APIs available to block drivers.
>    - Needed for per-path latency measurement in adaptive policy.
>
> [PATCH 2/6] nvme-multipath: add adaptive I/O policy
>    - Implement path scoring based on latency (EWMA).
>    - Distribute I/O proportionally to per-path weights.
>
> [PATCH 3/6] nvme: add generic debugfs support
>    - Introduce generic debugfs support for NVMe module
>
> [PATCH 4/6] nvme-multipath: add debugfs attribute adaptive_ewma_shift
>    - Adds a debugfs attribute to control ewma shift
>
> [PATCH 5/6] nvme-multipath: add debugfs attribute adaptive_weight_timeout
>    - Adds a debugfs attribute to control path weight calculation timeout
>
> [PATCH 6/6] nvme-multipath: add debugfs attribute adaptive_stat
>    - Add “adaptive_stat” under per-path and head debugfs directories to
>      expose adaptive policy state and statistics.
>
> As ususal, feedback and suggestions are most welcome!
>
> Thanks!
>
> Changes from v3:
>    - Update the adaptive APIs name (which actually enable/disable
>      adaptive policy) to reflect the actual work it does. Also removed
>      the misleading use of "current_path" from the adaptive policy code
>      (Hannes Reinecke)
>    - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
>      sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
>
> Changes from v2:
>    - Addede a new patch to allow user to configure EWMA shift
>      through sysfs (Hannes Reinecke)
>    - Added a new patch to allow user to configure path weight
>      calculation timeout (Hannes Reinecke)
>    - Distinguish between read/write and other commands (e.g.
>      admin comamnd) and calculate path weight for other commands
>      which is separate from read/write weight. (Hannes Reinecke)
>    - Normalize per-path weight in the range from 0-128 instead
>      of 0-100 (Hannes Reinecke)
>    - Restructure and optimize adaptive I/O forwarding code to use
>      one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
>
> Changes from v1:
>    - Ensure that the completion of I/O occurs on the same CPU as the
>      submitting I/O CPU (Hannes Reinecke)
>    - Remove adapter link speed from the path weight calculation
>      (Hannes Reinecke)
>    - Add adaptive I/O stat under debugfs instead of current sysfs
>      (Hannes Reinecke)
>    - Move path weight calculation to a workqueue from IO completion
>      code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
>
> Nilay Shroff (6):
>    block: expose blk_stat_{enable,disable}_accounting() to drivers
>    nvme-multipath: add support for adaptive I/O policy
>    nvme: add generic debugfs support
>    nvme-multipath: add debugfs attribute adaptive_ewma_shift
>    nvme-multipath: add debugfs attribute adaptive_weight_timeout
>    nvme-multipath: add debugfs attribute adaptive_stat
>
>   block/blk-stat.h              |   4 -
>   drivers/nvme/host/Makefile    |   2 +-
>   drivers/nvme/host/core.c      |  22 +-
>   drivers/nvme/host/debugfs.c   | 335 ++++++++++++++++++++++++++
>   drivers/nvme/host/ioctl.c     |  31 ++-
>   drivers/nvme/host/multipath.c | 430 +++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/nvme.h      |  87 ++++++-
>   drivers/nvme/host/pr.c        |   6 +-
>   drivers/nvme/host/sysfs.c     |   2 +-
>   include/linux/blk-mq.h        |   4 +
>   10 files changed, 895 insertions(+), 28 deletions(-)
>   create mode 100644 drivers/nvme/host/debugfs.c
>




More information about the Linux-nvme mailing list