[RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy

Nilay Shroff nilay at linux.ibm.com
Tue Dec 9 05:56:40 PST 2025


Hi Keith,

Just gentle ping on this one...

It has been reviewed and ready for some time now, and I wanted to check if you
had any remaining feedback or concerns, or if you could consider pulling it
into nvme-next.

Link to the latest version for convenience:
https://lore.kernel.org/all/20251105103347.86059-1-nilay@linux.ibm.com/

Please let me know if there's anything further needed on my side.

Thanks,
--Nilay

On 11/5/25 4:03 PM, Nilay Shroff wrote:
> Hi,
> 
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
> 
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
> 
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
> 
>         numa         round-robin   queue-depth  adaptive
>         -----------  -----------   -----------  ---------
> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>         W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
> 
> This pathcset includes totla 6 patches:
> [PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
>   - Make blk_stat APIs available to block drivers.
>   - Needed for per-path latency measurement in adaptive policy.
> 
> [PATCH 2/7] nvme-multipath: add adaptive I/O policy
>   - Implement path scoring based on latency (EWMA).
>   - Distribute I/O proportionally to per-path weights.
> 
> [PATCH 3/7] nvme: add generic debugfs support
>   - Introduce generic debugfs support for NVMe module
> 
> [PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   - Adds a debugfs attribute to control ewma shift
> 
> [PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   - Adds a debugfs attribute to control path weight calculation timeout
> 
> [PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
>   - Add “adaptive_stat” under per-path and head debugfs directories to
>     expose adaptive policy state and statistics.
> 
> [PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
>   - Includes documentation for adaptive I/O multipath policy.
> 
> As ususal, feedback and suggestions are most welcome!
> 
> Thanks!
> 
> Changes from v4:
>   - Added patch #7 which includes the documentation for adaptive I/O
>     policy. (Guixin Liu)
> Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/    
> 
> Changes from v3:
>   - Update the adaptive APIs name (which actually enable/disable
>     adaptive policy) to reflect the actual work it does. Also removed
>     the misleading use of "current_path" from the adaptive policy code
>     (Hannes Reinecke)
>   - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
>     sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
> 
> Changes from v2:
>   - Addede a new patch to allow user to configure EWMA shift
>     through sysfs (Hannes Reinecke)
>   - Added a new patch to allow user to configure path weight
>     calculation timeout (Hannes Reinecke)
>   - Distinguish between read/write and other commands (e.g.
>     admin comamnd) and calculate path weight for other commands
>     which is separate from read/write weight. (Hannes Reinecke)
>   - Normalize per-path weight in the range from 0-128 instead
>     of 0-100 (Hannes Reinecke)
>   - Restructure and optimize adaptive I/O forwarding code to use
>     one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
> 
> Changes from v1:
>   - Ensure that the completion of I/O occurs on the same CPU as the
>     submitting I/O CPU (Hannes Reinecke)
>   - Remove adapter link speed from the path weight calculation
>     (Hannes Reinecke)
>   - Add adaptive I/O stat under debugfs instead of current sysfs
>     (Hannes Reinecke)
>   - Move path weight calculation to a workqueue from IO completion
>     code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
> 
> Nilay Shroff (7):
>   block: expose blk_stat_{enable,disable}_accounting() to drivers
>   nvme-multipath: add support for adaptive I/O policy
>   nvme: add generic debugfs support
>   nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   nvme-multipath: add debugfs attribute adaptive_stat
>   nvme-multipath: add documentation for adaptive I/O policy
> 
>  Documentation/admin-guide/nvme-multipath.rst |  19 +
>  block/blk-stat.h                             |   4 -
>  drivers/nvme/host/Makefile                   |   2 +-
>  drivers/nvme/host/core.c                     |  22 +-
>  drivers/nvme/host/debugfs.c                  | 335 +++++++++++++++
>  drivers/nvme/host/ioctl.c                    |  31 +-
>  drivers/nvme/host/multipath.c                | 430 ++++++++++++++++++-
>  drivers/nvme/host/nvme.h                     |  86 +++-
>  drivers/nvme/host/pr.c                       |   6 +-
>  drivers/nvme/host/sysfs.c                    |   2 +-
>  include/linux/blk-mq.h                       |   4 +
>  11 files changed, 913 insertions(+), 28 deletions(-)
>  create mode 100644 drivers/nvme/host/debugfs.c
> 




More information about the Linux-nvme mailing list