[RFC PATCHv3 2/6] nvme-multipath: add support for adaptive I/O policy

Wed Oct 29 07:21:24 PDT 2025

Hi Christoph,

On 10/29/25 3:10 PM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 02:59:36PM +0530, Nilay Shroff wrote:
>> This commit introduces a new I/O policy named "adaptive". Users can
>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>> subsystemX/iopolicy"
>>
>> The adaptive policy dynamically distributes I/O based on measured
>> completion latency. The main idea is to calculate latency for each path,
>> derive a weight, and then proportionally forward I/O according to those
>> weights.
> 
> This really sounds like a lot of overhead, and really smells of all
> that horrible old-school FC SAN thinking we've been carefully trying
> to avoid in nvme.
> 
> What's the point here?

Thanks for the feedback!

The primary goal of the adaptive I/O policy is to improve I/O distribution
across NVMe paths in cases where the backend latency varies significantly —
for example, with asymmetric paths in multipath environments (different fabrics,
load imbalance, or transient congestion). So it's mainly focused on the
NVMeOF deployment.

The mechanism itself is lightweight. The policy samples I/O completion latency
periodically (with configurable sample window and EWMA smoothing) and derives
proportional weights per path. The intent is to avoid excessive control overhead
while providing better path utilization than simple round-robin or queue-depth
based policies in heterogeneous environments.

That said, I’m happy to share some profiling data and latency distributions
from test runs to demonstrate that the overhead remains negligible (<1% CPU
impact in current measurements) and that throughput improves when path
 latencies diverge.

For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:

        numa         round-robin   queue-depth  adaptive
        -----------  -----------   -----------  ---------
READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
        W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s

Would that context and data help clarify the motivation? 

Furthermore, the goal isn’t to micromanage path scheduling in real time, but
to let the host bias I/O slightly toward faster or healthier paths without
violating NVMe’s design principles of simplicity and scalability. If this still
feels too policy-heavy for the block layer, I’m happy to consider ways to
simplify or confine it so that it stays aligned with NVMe’s overall design
philosophy.

Thanks,
--Nilay