[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Sun Jan 4 01:07:48 PST 2026

On 12/27/25 3:07 PM, Sagi Grimberg wrote:
> 
>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>> file I used for the test, followed by the observed throughput result for reference.
>>
>> Job file:
>> =========
>>
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> cpumode=qsort
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n2
>> rw=<randread/randwrite/randrw>
>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>> iodepth=32
>> numjobs=32
>> direct=1
>>
>> Throughput:
>> ===========
>>
>>           numa          round-robin   queue-depth    adaptive
>>           -----------   -----------   -----------    ---------
>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>
>> When comparing the results, I did not observe a significant throughput
>> difference between the queue-depth, round-robin, and adaptive policies.
>> With random I/O of mixed sizes, the adaptive policy appears to average
>> out the varying latency values and distribute I/O reasonably evenly
>> across the active paths (assuming symmetric paths).
>>
>> Next I'd implement I/O size buckets and also per-numa node weight and
>> then rerun tests and share the result. Lets see if these changes help
>> further improve the throughput number for adaptive policy. We may then
>> again review the results and discuss further.
>>
>> Thanks,
>> --Nilay
> 
> two comments:
> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
> the datapath does not introduce serialization).

Thanks for the suggestions. I ran experiments incorporating both points—
biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
weight calculation—using the following setup.

Job file:
=========
[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
==========

[1] Block-size distributions:
    randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
    randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
    randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5

Results:
=======

i) Symmetric paths + system load
   (CPU stress using cpuload):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    636          621                   613           618  
WRITE:   1832         1847                  1840          1852
RW:      R:872        R:869                 R:866         R:874   
         W:872        W:870                 W:867         W:876 

ii) Asymmetric paths + system load
   (CPU stress using cpuload and iperf3 traffic for inducing network congestion):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    553          543                   540           533  
WRITE:   1705         1670                  1710          1655
RW:      R:769        R:771                 R:784         R:772   
         W:768        W:767                 W:785         W:771 

Looking at the above results,
- Per-CPU vs per-CPU with I/O buckets:
  The per-CPU implementation already averages latency effectively across CPUs.
  Introducing per-CPU I/O buckets does not provide a meaningful throughput
  improvement and remains largely comparable.

- Per-CPU vs per-NUMA aggregation:
  Calculating or averaging weights at the NUMA level does not significantly
  improve throughput over per-CPU weight calculation. Across both symmetric
  and asymmetric scenarios, the results remain very close.

So now based on above results and assessment, unless there are additional
scenarios or metrics of interest, shall we proceed with per-CPU weight 
calculation for this new I/O policy?

Thanks,
--Nilay