[RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Sat Dec 13 00:22:39 PST 2025
On 12/12/25 5:38 PM, Sagi Grimberg wrote:
>
>
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> Hi,
>>
>> This series introduces a new adaptive I/O policy for NVMe native
>> multipath. Existing policies such as numa, round-robin, and queue-depth
>> are static and do not adapt to real-time transport performance.
>
> It can be argued that queue-depth is a proxy of latency.
>
>> The numa
>> selects the path closest to the NUMA node of the current CPU, optimizing
>> memory and path locality, but ignores actual path performance. The
>> round-robin distributes I/O evenly across all paths, providing fairness
>> but not performance awareness. The queue-depth reacts to instantaneous
>> queue occupancy, avoiding heavily loaded paths, but does not account for
>> actual latency, throughput, or link speed.
>>
>> The new adaptive policy addresses these gaps selecting paths dynamically
>> based on measured I/O latency for both PCIe and fabrics.
>
> Adaptive is not a good name. Maybe weighted-latency of wplat (weighted path latency)
> or something like that.
>
Yeah I also talked to Hannes about this and he suggest naming it either "weighed-latency"
or "ewma-latency". What do you prefer?
>> Latency is
>> derived by passively sampling I/O completions. Each path is assigned a
>> weight proportional to its latency score, and I/Os are then forwarded
>> accordingly. As condition changes (e.g. latency spikes, bandwidth
>> differences), path weights are updated, automatically steering traffic
>> toward better-performing paths.
>>
>> Early results show reduced tail latency under mixed workloads and
>> improved throughput by exploiting higher-speed links more effectively.
>> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
>> delay), fio results with random read/write/rw workloads (direct I/O)
>> showed:
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
>> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
>> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
>> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
>
> Seems like a nice gain.
> Can you please test for the normal symmetric paths case? Would like
> to see the trade-off...
Yes, I've already tested that. I currently don’t have access to the system,
but based on my earlier runs, the performance for the symmetric-path case
was noticeably better than in the NUMA scenario, and roughly in the same
(or slightly better) range as the round-robin/qdepth I/O policy. I will
share those numbers later once I get the access.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list