[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Sagi Grimberg
sagi at grimberg.me
Sun Jan 4 13:06:27 PST 2026
On 04/01/2026 11:07, Nilay Shroff wrote:
>
> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>> file I used for the test, followed by the observed throughput result for reference.
>>>
>>> Job file:
>>> =========
>>>
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> cpumode=qsort
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n2
>>> rw=<randread/randwrite/randrw>
>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>>
>>> Throughput:
>>> ===========
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>
>>> When comparing the results, I did not observe a significant throughput
>>> difference between the queue-depth, round-robin, and adaptive policies.
>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>> out the varying latency values and distribute I/O reasonably evenly
>>> across the active paths (assuming symmetric paths).
>>>
>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>> then rerun tests and share the result. Lets see if these changes help
>>> further improve the throughput number for adaptive policy. We may then
>>> again review the results and discuss further.
>>>
>>> Thanks,
>>> --Nilay
>> two comments:
>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>> the datapath does not introduce serialization).
> Thanks for the suggestions. I ran experiments incorporating both points—
> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
> weight calculation—using the following setup.
>
> Job file:
> =========
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n1
> rw=<randread/randwrite/randrw>
> bssplit=<based-on-I/O-pattern-type>[1]
> iodepth=32
> numjobs=32
> direct=1
> ==========
>
> [1] Block-size distributions:
> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>
> Results:
> =======
>
> i) Symmetric paths + system load
> (CPU stress using cpuload):
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 636 621 613 618
> WRITE: 1832 1847 1840 1852
> RW: R:872 R:869 R:866 R:874
> W:872 W:870 W:867 W:876
>
> ii) Asymmetric paths + system load
> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 553 543 540 533
> WRITE: 1705 1670 1710 1655
> RW: R:769 R:771 R:784 R:772
> W:768 W:767 W:785 W:771
>
>
> Looking at the above results,
> - Per-CPU vs per-CPU with I/O buckets:
> The per-CPU implementation already averages latency effectively across CPUs.
> Introducing per-CPU I/O buckets does not provide a meaningful throughput
> improvement and remains largely comparable.
>
> - Per-CPU vs per-NUMA aggregation:
> Calculating or averaging weights at the NUMA level does not significantly
> improve throughput over per-CPU weight calculation. Across both symmetric
> and asymmetric scenarios, the results remain very close.
>
> So now based on above results and assessment, unless there are additional
> scenarios or metrics of interest, shall we proceed with per-CPU weight
> calculation for this new I/O policy?
I think it is counter intuitive that bucketing I/O sizes does not
present any advantage. Don't you?
Maybe the test is not good enough of a representation...
Lets also test what happens with multiple clients against the same
subsystem.
More information about the Linux-nvme
mailing list