[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Sagi Grimberg sagi at grimberg.me
Sun Jan 4 13:06:27 PST 2026



On 04/01/2026 11:07, Nilay Shroff wrote:
>
> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>> file I used for the test, followed by the observed throughput result for reference.
>>>
>>> Job file:
>>> =========
>>>
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> cpumode=qsort
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n2
>>> rw=<randread/randwrite/randrw>
>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>>
>>> Throughput:
>>> ===========
>>>
>>>            numa          round-robin   queue-depth    adaptive
>>>            -----------   -----------   -----------    ---------
>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>
>>> When comparing the results, I did not observe a significant throughput
>>> difference between the queue-depth, round-robin, and adaptive policies.
>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>> out the varying latency values and distribute I/O reasonably evenly
>>> across the active paths (assuming symmetric paths).
>>>
>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>> then rerun tests and share the result. Lets see if these changes help
>>> further improve the throughput number for adaptive policy. We may then
>>> again review the results and discuss further.
>>>
>>> Thanks,
>>> --Nilay
>> two comments:
>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>> the datapath does not introduce serialization).
> Thanks for the suggestions. I ran experiments incorporating both points—
> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
> weight calculation—using the following setup.
>
> Job file:
> =========
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n1
> rw=<randread/randwrite/randrw>
> bssplit=<based-on-I/O-pattern-type>[1]
> iodepth=32
> numjobs=32
> direct=1
> ==========
>
> [1] Block-size distributions:
>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>
> Results:
> =======
>
> i) Symmetric paths + system load
>     (CPU stress using cpuload):
>
>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>           -------   -------------------      --------   -------------------
> READ:    636          621                   613           618
> WRITE:   1832         1847                  1840          1852
> RW:      R:872        R:869                 R:866         R:874
>           W:872        W:870                 W:867         W:876
>
> ii) Asymmetric paths + system load
>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>
>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>           -------   -------------------      --------   -------------------
> READ:    553          543                   540           533
> WRITE:   1705         1670                  1710          1655
> RW:      R:769        R:771                 R:784         R:772
>           W:768        W:767                 W:785         W:771
>
>
> Looking at the above results,
> - Per-CPU vs per-CPU with I/O buckets:
>    The per-CPU implementation already averages latency effectively across CPUs.
>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>    improvement and remains largely comparable.
>
> - Per-CPU vs per-NUMA aggregation:
>    Calculating or averaging weights at the NUMA level does not significantly
>    improve throughput over per-CPU weight calculation. Across both symmetric
>    and asymmetric scenarios, the results remain very close.
>
> So now based on above results and assessment, unless there are additional
> scenarios or metrics of interest, shall we proceed with per-CPU weight
> calculation for this new I/O policy?

I think it is counter intuitive that bucketing I/O sizes does not 
present any advantage. Don't you?
Maybe the test is not good enough of a representation...

Lets also test what happens with multiple clients against the same 
subsystem.



More information about the Linux-nvme mailing list