[LSF/MM/BPF TOPIC] Adaptive NVMe Multipath I/O Policy: Latency Measurement and Path Scoring Design
Nilay Shroff
nilay at linux.ibm.com
Fri Feb 20 01:52:39 PST 2026
Hi,
I posted an RFC [1], back in sept-2025, proposing a new NVMe multipath I/O
policy called adaptive I/O. The series is currently at v5 [2] and has received
valuable feedback from the community, which has been incorporated into subsequent
revisions. The policy dynamically measures I/O latency for each active NVMe
path and derives a weight/score per path based on observed latency. I/O is
then distributed proportionally across paths: lower-latency paths receive more
I/O while higher-latency paths receive less.
Initial experiments show clear performance benefits, particularly for NVMe-oF
deployments where path characteristics can differ due to fabric bottlenecks
or heterogeneous links. In such environments, static or round-robin policies
may underperform, whereas a latency-aware policy can better utilize available
bandwidth and improve overall throughput and tail latency.
While there is general agreement on the usefulness of a latency-aware policy,
there is not yet consensus on how latency should be measured and how path
scores should be derived. The original proposal measured latency per-CPU, but
concerns were raised about mixed I/O workloads and potential noise from
scheduler effects. In response, additional approaches were explored, including
I/O-size buckets and NUMA-aware measurements.
##Discussion
At this point, four possible approaches are under discussion:
1. Per-CPU latency scoring:
Measure latency per CPU and derive per-CPU path scores. Use the current
CPU’s score to choose the path when dispatching I/O.
2. Per-CPU with I/O-size buckets:
Measure latency per CPU and maintain separate path scores per I/O-size bucket.
Dispatch decisions use the score corresponding to the current CPU and I/O size.
3. Per-NUMA latency scoring:
Measure latency per NUMA node and derive per-NUMA path scores. Dispatch decisions
use the score associated with the NUMA node where the I/O originates.
4. Per-NUMA with I/O-size buckets:
Measure latency per NUMA node and maintain per-NUMA, per-I/O-size path scores.
Dispatch decisions consider both NUMA locality and I/O size.
We have not yet reached consensus on which of these approaches provides the best
balance between accuracy, stability, and implementation complexity. An in-person
discussion at LSFMM would help bring stakeholders together to evaluate trade-offs,
share data, and converge on a direction so we can move the adaptive I/O policy
toward upstream acceptance.
[1] https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
[2] https://lore.kernel.org/all/20251105103347.86059-1-nilay@linux.ibm.com/
Thanks,
--Nilay
More information about the Linux-nvme
mailing list