[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Tue Jan 6 06:16:29 PST 2026

On 1/5/26 2:36 AM, Sagi Grimberg wrote:
> 
> 
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>>            numa          round-robin   queue-depth    adaptive
>>>>            -----------   -----------   -----------    ---------
>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>>     (CPU stress using cpuload):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   -------------------
>> READ:    636          621                   613           618
>> WRITE:   1832         1847                  1840          1852
>> RW:      R:872        R:869                 R:866         R:874
>>           W:872        W:870                 W:867         W:876
>>
>> ii) Asymmetric paths + system load
>>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   -------------------
>> READ:    553          543                   540           533
>> WRITE:   1705         1670                  1710          1655
>> RW:      R:769        R:771                 R:784         R:772
>>           W:768        W:767                 W:785         W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>>    The per-CPU implementation already averages latency effectively across CPUs.
>>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>    improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>>    Calculating or averaging weights at the NUMA level does not significantly
>>    improve throughput over per-CPU weight calculation. Across both symmetric
>>    and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
> 
> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
> 
Hmm you were correct, I also thought the same but I couldn't find 
any test which could prove the advantage using I/O buckets. Then
today I spend some time thinking about the scenarios which could
prove the worth using I/O buckets. After some thought I came up
with following use case.

Size-dependent path behavior:

1. Example:
   Path A: good for ≤16k, bad for ≥32k
   Path B: good for all 

   Now running mixed I/O (bssplit => 16k/75:64k/25),

   Without buckets:
   Path B looks good; scheduler forwards more I/Os towards path B.

   With buckets:
   small I/Os are distributed across path A and B
   large I/Os favor path B

   So in theory, throughput shall improve with buckets.

2. Example:
   Path A: good for ≤16k, bad for ≥32k
   Path B: opposite

   Without buckets:
   latency averages cancel out
   scheduler sees “paths are equal”

   With buckets:
   small I/O bucket favors A
   large I/O bucket favors B

   Again in theory, throughput shall improve with buckets.

So with the above thought, I ran another experiment and results
are shown below:

Injecting additional delay on one path for larger packets (>=32k)
and mixing I/Os with bssplit => 16k/75:64k/25. So with this
test, we have,
Path A: good for ≤16k, bad for ≥32k
Path B: good for all  

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
         -------   -------------------      --------   -------------------
READ:    550          622                   523         615
WRITE:   726          829                   747         834
RW:      R:324        R:381                 R: 306     R:375
         W:323        W:381                 W: 306     W:374

So yes I/O buckets could be useful for the scenario tested 
above. And regarding per-CPU vs per-NUMA weight calculation
do you agree per-CPU should be good enough for this policy
as we saw above per-NUMA doesn't help improve much performance?

> Lets also test what happens with multiple clients against the same subsystem.
Yes this is a good test to run, I will test and post result.

Thanks,
--Nilay