[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Mon Feb 2 05:33:49 PST 2026

On 1/6/26 7:46 PM, Nilay Shroff wrote:
> 
> 
> On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>>
>>
>> On 04/01/2026 11:07, Nilay Shroff wrote:
>>>
>>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>>
>>>>> Job file:
>>>>> =========
>>>>>
>>>>> [global]
>>>>> time_based
>>>>> runtime=120
>>>>> group_reporting=1
>>>>>
>>>>> [cpu]
>>>>> ioengine=cpuio
>>>>> cpuload=85
>>>>> cpumode=qsort
>>>>> numjobs=32
>>>>>
>>>>> [disk]
>>>>> ioengine=io_uring
>>>>> filename=/dev/nvme1n2
>>>>> rw=<randread/randwrite/randrw>
>>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>>> iodepth=32
>>>>> numjobs=32
>>>>> direct=1
>>>>>
>>>>> Throughput:
>>>>> ===========
>>>>>
>>>>>            numa          round-robin   queue-depth    adaptive
>>>>>            -----------   -----------   -----------    ---------
>>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>>
>>>>> When comparing the results, I did not observe a significant throughput
>>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>>> out the varying latency values and distribute I/O reasonably evenly
>>>>> across the active paths (assuming symmetric paths).
>>>>>
>>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>>> then rerun tests and share the result. Lets see if these changes help
>>>>> further improve the throughput number for adaptive policy. We may then
>>>>> again review the results and discuss further.
>>>>>
>>>>> Thanks,
>>>>> --Nilay
>>>> two comments:
>>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>>> the datapath does not introduce serialization).
>>> Thanks for the suggestions. I ran experiments incorporating both points—
>>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>>> weight calculation—using the following setup.
>>>
>>> Job file:
>>> =========
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n1
>>> rw=<randread/randwrite/randrw>
>>> bssplit=<based-on-I/O-pattern-type>[1]
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>> ==========
>>>
>>> [1] Block-size distributions:
>>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>>
>>> Results:
>>> =======
>>>
>>> i) Symmetric paths + system load
>>>     (CPU stress using cpuload):
>>>
>>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>>           -------   -------------------      --------   -------------------
>>> READ:    636          621                   613           618
>>> WRITE:   1832         1847                  1840          1852
>>> RW:      R:872        R:869                 R:866         R:874
>>>           W:872        W:870                 W:867         W:876
>>>
>>> ii) Asymmetric paths + system load
>>>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>>
>>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>>           -------   -------------------      --------   -------------------
>>> READ:    553          543                   540           533
>>> WRITE:   1705         1670                  1710          1655
>>> RW:      R:769        R:771                 R:784         R:772
>>>           W:768        W:767                 W:785         W:771
>>>
>>>
>>> Looking at the above results,
>>> - Per-CPU vs per-CPU with I/O buckets:
>>>    The per-CPU implementation already averages latency effectively across CPUs.
>>>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>>    improvement and remains largely comparable.
>>>
>>> - Per-CPU vs per-NUMA aggregation:
>>>    Calculating or averaging weights at the NUMA level does not significantly
>>>    improve throughput over per-CPU weight calculation. Across both symmetric
>>>    and asymmetric scenarios, the results remain very close.
>>>
>>> So now based on above results and assessment, unless there are additional
>>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>>> calculation for this new I/O policy?
>>
>> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
>> Maybe the test is not good enough of a representation...
>>
> Hmm you were correct, I also thought the same but I couldn't find 
> any test which could prove the advantage using I/O buckets. Then
> today I spend some time thinking about the scenarios which could
> prove the worth using I/O buckets. After some thought I came up
> with following use case.
> 
> Size-dependent path behavior:
> 
> 1. Example:
>    Path A: good for ≤16k, bad for ≥32k
>    Path B: good for all 
> 
>    Now running mixed I/O (bssplit => 16k/75:64k/25),
> 
>    Without buckets:
>    Path B looks good; scheduler forwards more I/Os towards path B.
> 
>    With buckets:
>    small I/Os are distributed across path A and B
>    large I/Os favor path B
> 
>    So in theory, throughput shall improve with buckets.
> 
> 2. Example:
>    Path A: good for ≤16k, bad for ≥32k
>    Path B: opposite
> 
>    Without buckets:
>    latency averages cancel out
>    scheduler sees “paths are equal”
> 
>    With buckets:
>    small I/O bucket favors A
>    large I/O bucket favors B
> 
>    Again in theory, throughput shall improve with buckets.
> 
> So with the above thought, I ran another experiment and results
> are shown below:
> 
> Injecting additional delay on one path for larger packets (>=32k)
> and mixing I/Os with bssplit => 16k/75:64k/25. So with this
> test, we have,
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all  
> 
>          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>          -------   -------------------      --------   -------------------
> READ:    550          622                   523         615
> WRITE:   726          829                   747         834
> RW:      R:324        R:381                 R: 306     R:375
>          W:323        W:381                 W: 306     W:374
> 
> So yes I/O buckets could be useful for the scenario tested 
> above. And regarding per-CPU vs per-NUMA weight calculation
> do you agree per-CPU should be good enough for this policy
> as we saw above per-NUMA doesn't help improve much performance?
> 
> 
>> Lets also test what happens with multiple clients against the same subsystem.
> Yes this is a good test to run, I will test and post result.
> 

Finally, I was able to run tests with two nvmf-tcp hosts connected
to the same nvmf-tcp target. Apologies for the delay — setting up this
topology took some time, partly due to recent non-technical infrastructure
challenges after our lab relocation.

The goal of these tests was to evaluate per-CPU vs per-NUMA weight calculation,
with and without I/O size buckets, under multi-client contention.

I ran tests (randread, randwrite and randrw) with mixed I/O (using bssplit)
and added the CPU stress on hosts using cpuload as I already did for my 
earlier tests. Please find below the test result and observation.

Workload characteristics:
=========================
- Workloads tested: randread, randwrite, randrw
- Mixed I/O sizes using bssplit
- CPU stress induced using cpuload
- Both hosts run workloads simultaneously

Job file:
=========
[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
ramp-time=120

[1] Block-size distributions:
      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5

Test topology: 
==============
1. Two nvmf-tcp hosts connected to the same nvmf-tcp target
2. Each host connects to target using two symmetric paths
3. System load on each host is induced using cpuload (as shown in jobfile)
4. Both hosts run I/O workloads concurrently

Results:
=======
Host1:
          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
          -------   -------------------      --------   -------------------
READ:      153         164			166        131
WRITE:     839         837                      889        839
RW:        R:249       R:255                    R:226      R:256
           W:247       W:254                    W:225      W:253

Host2:

          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
          -------   -------------------      --------   -------------------
READ:     268          258                     279         268
WRITE:    1012         992                     880         1017
RW:       R:386        R:410                   R:401       R:405
          W:385        W:409                   W:399       W:405

>From the above results, I have got the same impression as earlier while I ran the
similar tests between one nvmf-tcp host and target. Looking at the above results,

Per-CPU vs per-CPU with I/O buckets:
- The per-CPU implementation already averages latency effectively across CPUs.
- Introducing per-CPU I/O buckets does not provide a meaningful throughput 
  improvement in the general case.
- Results remain largely comparable across workloads and hosts.
- However, as shown in earlier experiments with I/O size–dependent path behavior,
  I/O buckets can provide measurable benefits in specific scenarios.

Per-CPU vs per-NUMA aggregation:
- Calculating or averaging weights at the NUMA level does not significantly improve
  throughput over per-CPU weight calculation.
- This holds true even under multi-host contention.

Based on all the tests conducted so far, including, symmetric and asymmetric paths,
CPU stress, size-dependent path behavior and multi-client access to the same target:

The results suggest that we should move forward with a per-CPU implementation using
I/O buckets. That said, I am open to any further feedback, suggestions, or additional
scenarios that might be worth evaluating.

Thanks,
--Nilay