[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Fri Dec 26 10:16:08 PST 2025

On 12/25/25 6:15 PM, Sagi Grimberg wrote:
> 
> 
> On 23/12/2025 16:50, Nilay Shroff wrote:
>> [...]
>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>
>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>
>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>> suited for the normal case; I do wonder, though, if for high-speed
>>> links we do see a difference as the data transfer time is getting
>>> really fast...
>>>
>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>> though I think for a very large I/O size, we would see a higer latency due
>> to tcp segmentation and re-assembly.
>>
>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>> for varying I/O size (captured for random-read direct I/O workload):
>> I/O-size    Avg-latency(usec)
>>   512            12113
>>   1k             10058
>>   2k             11246
>>   4k             12458
>>   8k             12189
>>   16k            11617
>>   32k            17686
>>   64k            28504
>>   128k           59013
>>   256k           118984
>>   512k           233428
>>   1M             460000
>>
>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>> above, latency increases significantly and roughly doubles with each step in
>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>> to preserve latency characteristics while avoiding excessive fragmentation of
>> statistics. The suggested bucket layout is as follows:
>>
>> Bucket        block-size-range
>> small        512B-32k
>> medium        32k-64k
>> large-64k    64k-128k
>> large-128k    128k-256k
>> large-256k    256k-512k
>> large-512k    512k-1M
>> very-large    >=1M
>>
>> In this model,
>> - A single small bucket captures latency for I/O sizes where latency remains
>>    largely uniform.
>> - A medium bucket captures the transition region.
>> - Separate large buckets preserve the rapidly increasing latency behavior
>>    observed for larger block sizes.
>> - A very-large bucket handles any I/O beyond 1M.
>>
>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>> make sense?
> 
> Yes
> 
>>
>>> [ .. ]
>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>> stat per-hctx instead of per-CPU.
>>>>>>
>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>> latency characteristics.
>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>> preserving a true end-to-end view of path latency, agreed?
>>>>
>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>> But two of these paths will always be on the same NUMA node).
>>> So that doesn't work out.
>>>
>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>> different NVMe multipath I/O policies:
>>>>
>>>>           numa         round-robin   queue-depth  adaptive
>>>>           -----------  -----------   -----------  ---------
>>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>>           W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>>
>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>> comapred against existing policies.
>>>>    
>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>> Talk to me about FPIN ...
>>>
>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>> Below are the aggregated throughput results observed under different NVMe multipath
>> I/O policies.
>>
>> i) Stressing all 32 cpus using stress-ng
>>
>> All 32 CPUs were stressed using:
>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>
>>           numa          round-robin   queue-depth  adaptive
>>           -----------   -----------   -----------  ---------
>> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
>> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
>> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>>           W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>>
>> ii) Symmetric paths (No CPU stress and no induced network load):
>>
>>           numa          round-robin   queue-depth   adaptive
>>           -----------   -----------   -----------   ---------
>> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
>> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
>> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>>          W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>>
>> These results show that the adaptive I/O policy consistently delivers higher
>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>> paths the adaptive policy achieves throughput comparable to—or slightly
>> better than—existing policies.
> 
> I still think that accounting uncorrelated latency is the best approach here.
> 
> My intuition tells me that:
> 1. averaging latencies over numa-node
> 2. calculating weights
> 3. distribute new weights per-cpu in the numa-node
> 
> Is a better approach. It is hard to evaluate without adding some randomness.
> 
> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?

Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
file I used for the test, followed by the observed throughput result for reference.

Job file:
=========

[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
cpumode=qsort
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n2
rw=<randread/randwrite/randrw>
bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
iodepth=32
numjobs=32
direct=1

Throughput:
===========

         numa          round-robin   queue-depth    adaptive
         -----------   -----------   -----------    ---------
READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
         W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s

When comparing the results, I did not observe a significant throughput
difference between the queue-depth, round-robin, and adaptive policies.
With random I/O of mixed sizes, the adaptive policy appears to average
out the varying latency values and distribute I/O reasonably evenly
across the active paths (assuming symmetric paths).

Next I'd implement I/O size buckets and also per-numa node weight and
then rerun tests and share the result. Lets see if these changes help
further improve the throughput number for adaptive policy. We may then
again review the results and discuss further.

Thanks,
--Nilay