[RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy

Nilay Shroff nilay at linux.ibm.com
Sat Dec 13 00:22:39 PST 2025



On 12/12/25 5:38 PM, Sagi Grimberg wrote:
> 
> 
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> Hi,
>>
>> This series introduces a new adaptive I/O policy for NVMe native
>> multipath. Existing policies such as numa, round-robin, and queue-depth
>> are static and do not adapt to real-time transport performance.
> 
> It can be argued that queue-depth is a proxy of latency.
> 
>>   The numa
>> selects the path closest to the NUMA node of the current CPU, optimizing
>> memory and path locality, but ignores actual path performance. The
>> round-robin distributes I/O evenly across all paths, providing fairness
>> but not performance awareness. The queue-depth reacts to instantaneous
>> queue occupancy, avoiding heavily loaded paths, but does not account for
>> actual latency, throughput, or link speed.
>>
>> The new adaptive policy addresses these gaps selecting paths dynamically
>> based on measured I/O latency for both PCIe and fabrics.
> 
> Adaptive is not a good name. Maybe weighted-latency of wplat (weighted path latency)
> or something like that.
> 
Yeah I also talked to Hannes about this and he suggest naming it either "weighed-latency"
or "ewma-latency". What do you prefer? 

>>   Latency is
>> derived by passively sampling I/O completions. Each path is assigned a
>> weight proportional to its latency score, and I/Os are then forwarded
>> accordingly. As condition changes (e.g. latency spikes, bandwidth
>> differences), path weights are updated, automatically steering traffic
>> toward better-performing paths.
>>
>> Early results show reduced tail latency under mixed workloads and
>> improved throughput by exploiting higher-speed links more effectively.
>> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
>> delay), fio results with random read/write/rw workloads (direct I/O)
>> showed:
>>
>>          numa         round-robin   queue-depth  adaptive
>>          -----------  -----------   -----------  ---------
>> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
>> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
>> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>>          W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
> 
> Seems like a nice gain.
> Can you please test for the normal symmetric paths case? Would like
> to see the trade-off...
Yes, I've already tested that. I currently don’t have access to the system,
but based on my earlier runs, the performance for the symmetric-path case
was noticeably better than in the NUMA scenario, and roughly in the same
(or slightly better) range as the round-robin/qdepth I/O policy. I will 
share those numbers later once I get the access.

Thanks,
--Nilay  




More information about the Linux-nvme mailing list