[RFC PATCHv4 0/6] nvme-multipath: introduce adaptive I/O policy
Guixin Liu
kanie at linux.alibaba.com
Tue Nov 4 08:57:35 PST 2025
Hi Nilay:
Could you plz update Documentation/admin-guide/nvme-multipath.rst too?
Best Regards,
Guixin Liu
在 2025/11/4 18:45, Nilay Shroff 写道:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
>
> This pathcset includes totla 6 patches:
> [PATCH 1/6] block: expose blk_stat_{enable,disable}_accounting()
> - Make blk_stat APIs available to block drivers.
> - Needed for per-path latency measurement in adaptive policy.
>
> [PATCH 2/6] nvme-multipath: add adaptive I/O policy
> - Implement path scoring based on latency (EWMA).
> - Distribute I/O proportionally to per-path weights.
>
> [PATCH 3/6] nvme: add generic debugfs support
> - Introduce generic debugfs support for NVMe module
>
> [PATCH 4/6] nvme-multipath: add debugfs attribute adaptive_ewma_shift
> - Adds a debugfs attribute to control ewma shift
>
> [PATCH 5/6] nvme-multipath: add debugfs attribute adaptive_weight_timeout
> - Adds a debugfs attribute to control path weight calculation timeout
>
> [PATCH 6/6] nvme-multipath: add debugfs attribute adaptive_stat
> - Add “adaptive_stat” under per-path and head debugfs directories to
> expose adaptive policy state and statistics.
>
> As ususal, feedback and suggestions are most welcome!
>
> Thanks!
>
> Changes from v3:
> - Update the adaptive APIs name (which actually enable/disable
> adaptive policy) to reflect the actual work it does. Also removed
> the misleading use of "current_path" from the adaptive policy code
> (Hannes Reinecke)
> - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
> sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
>
> Changes from v2:
> - Addede a new patch to allow user to configure EWMA shift
> through sysfs (Hannes Reinecke)
> - Added a new patch to allow user to configure path weight
> calculation timeout (Hannes Reinecke)
> - Distinguish between read/write and other commands (e.g.
> admin comamnd) and calculate path weight for other commands
> which is separate from read/write weight. (Hannes Reinecke)
> - Normalize per-path weight in the range from 0-128 instead
> of 0-100 (Hannes Reinecke)
> - Restructure and optimize adaptive I/O forwarding code to use
> one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
>
> Changes from v1:
> - Ensure that the completion of I/O occurs on the same CPU as the
> submitting I/O CPU (Hannes Reinecke)
> - Remove adapter link speed from the path weight calculation
> (Hannes Reinecke)
> - Add adaptive I/O stat under debugfs instead of current sysfs
> (Hannes Reinecke)
> - Move path weight calculation to a workqueue from IO completion
> code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
>
> Nilay Shroff (6):
> block: expose blk_stat_{enable,disable}_accounting() to drivers
> nvme-multipath: add support for adaptive I/O policy
> nvme: add generic debugfs support
> nvme-multipath: add debugfs attribute adaptive_ewma_shift
> nvme-multipath: add debugfs attribute adaptive_weight_timeout
> nvme-multipath: add debugfs attribute adaptive_stat
>
> block/blk-stat.h | 4 -
> drivers/nvme/host/Makefile | 2 +-
> drivers/nvme/host/core.c | 22 +-
> drivers/nvme/host/debugfs.c | 335 ++++++++++++++++++++++++++
> drivers/nvme/host/ioctl.c | 31 ++-
> drivers/nvme/host/multipath.c | 430 +++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 87 ++++++-
> drivers/nvme/host/pr.c | 6 +-
> drivers/nvme/host/sysfs.c | 2 +-
> include/linux/blk-mq.h | 4 +
> 10 files changed, 895 insertions(+), 28 deletions(-)
> create mode 100644 drivers/nvme/host/debugfs.c
>
More information about the Linux-nvme
mailing list