[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Tue Dec 23 06:50:32 PST 2025

[...]
>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>> have much lower amortized latency per 512 block. which could create an false bias
>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>
>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>
> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
> suited for the normal case; I do wonder, though, if for high-speed
> links we do see a difference as the data transfer time is getting
> really fast...
> 
For a high speed/bandwidth NIC card the transfer speed would be very fast,
though I think for a very large I/O size, we would see a higer latency due
to tcp segmentation and re-assembly.

On my nvmf-tcp testbed, I do see the latency differences as shown below 
for varying I/O size (captured for random-read direct I/O workload):
I/O-size	Avg-latency(usec)
 512            12113
 1k             10058  
 2k             11246
 4k             12458
 8k             12189
 16k            11617 
 32k            17686
 64k            28504 
 128k           59013
 256k           118984 
 512k           233428
 1M             460000   

As can be seen, for smaller block sizes (512B–16K), latency remains relatively
stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
above, latency increases significantly and roughly doubles with each step in
block size. Based on this data, I propose using coarse-grained I/O size buckets
to preserve latency characteristics while avoiding excessive fragmentation of
statistics. The suggested bucket layout is as follows:

Bucket		block-size-range
small		512B-32k
medium		32k-64k
large-64k	64k-128k
large-128k	128k-256k
large-256k	256k-512k
large-512k	512k-1M
very-large	>=1M

In this model,
- A single small bucket captures latency for I/O sizes where latency remains
  largely uniform.
- A medium bucket captures the transition region.
- Separate large buckets preserve the rapidly increasing latency behavior
  observed for larger block sizes.
- A very-large bucket handles any I/O beyond 1M.

This approach allows the adaptive policy to retain meaningful latency distinctions across
I/O size regimes while keeping the number of buckets manageable and statistically stable,
make sense? 

> [ .. ]
>>>> I understand your concern about whether it really makes sense to keep this
>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>> stat per-hctx instead of per-CPU.
>>>>
>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>> latency characteristics.
>>>
>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>> then accessing these weights in the fast-path is still cheap enough?
>>
>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>> scope of what we are trying to measure, as it would largely exclude components of
>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>> actual I/O cost observed by the workload, which includes not only path and controller
>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>> preserving a true end-to-end view of path latency, agreed?
>>
> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
> But two of these paths will always be on the same NUMA node).
> So that doesn't work out.
> 
>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>> ioengine=io_uring. Below are the aggregated throughput results observed under
>> different NVMe multipath I/O policies:
>>
>>          numa         round-robin   queue-depth  adaptive
>>          -----------  -----------   -----------  ---------
>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>          W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>
>> These results show that under combined CPU and network stress, the adaptive I/O policy
>> consistently delivers higher throughput across read, write, and mixed workloads when
>> comapred against existing policies.
>>   
> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
> Talk to me about FPIN ...
> 

I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring. 
Below are the aggregated throughput results observed under different NVMe multipath
I/O policies.

i) Stressing all 32 cpus using stress-ng 

All 32 CPUs were stressed using:
# stress-ng --cpu 0 --cpu-method all -t 60m

         numa          round-robin   queue-depth  adaptive
         -----------   -----------   -----------  ---------
READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s   
WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
         W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s

ii) Symmetric paths (No CPU stress and no induced network load):

         numa          round-robin   queue-depth   adaptive
         -----------   -----------   -----------   ---------
READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s 
RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
        W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s

These results show that the adaptive I/O policy consistently delivers higher
throughput under CPU stress and asymmetric path conditions. In case of symmetric
paths the adaptive policy achieves throughput comparable to—or slightly
better than—existing policies. 

Thanks,
--Nilay