[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Tue Jan 6 06:16:29 PST 2026
On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>
>
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>> numa round-robin queue-depth adaptive
>>>> ----------- ----------- ----------- ---------
>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>> (CPU stress using cpuload):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- -------- -------------------
>> READ: 636 621 613 618
>> WRITE: 1832 1847 1840 1852
>> RW: R:872 R:869 R:866 R:874
>> W:872 W:870 W:867 W:876
>>
>> ii) Asymmetric paths + system load
>> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- -------- -------------------
>> READ: 553 543 540 533
>> WRITE: 1705 1670 1710 1655
>> RW: R:769 R:771 R:784 R:772
>> W:768 W:767 W:785 W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>> The per-CPU implementation already averages latency effectively across CPUs.
>> Introducing per-CPU I/O buckets does not provide a meaningful throughput
>> improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>> Calculating or averaging weights at the NUMA level does not significantly
>> improve throughput over per-CPU weight calculation. Across both symmetric
>> and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
>
> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
>
Hmm you were correct, I also thought the same but I couldn't find
any test which could prove the advantage using I/O buckets. Then
today I spend some time thinking about the scenarios which could
prove the worth using I/O buckets. After some thought I came up
with following use case.
Size-dependent path behavior:
1. Example:
Path A: good for ≤16k, bad for ≥32k
Path B: good for all
Now running mixed I/O (bssplit => 16k/75:64k/25),
Without buckets:
Path B looks good; scheduler forwards more I/Os towards path B.
With buckets:
small I/Os are distributed across path A and B
large I/Os favor path B
So in theory, throughput shall improve with buckets.
2. Example:
Path A: good for ≤16k, bad for ≥32k
Path B: opposite
Without buckets:
latency averages cancel out
scheduler sees “paths are equal”
With buckets:
small I/O bucket favors A
large I/O bucket favors B
Again in theory, throughput shall improve with buckets.
So with the above thought, I ran another experiment and results
are shown below:
Injecting additional delay on one path for larger packets (>=32k)
and mixing I/Os with bssplit => 16k/75:64k/25. So with this
test, we have,
Path A: good for ≤16k, bad for ≥32k
Path B: good for all
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 550 622 523 615
WRITE: 726 829 747 834
RW: R:324 R:381 R: 306 R:375
W:323 W:381 W: 306 W:374
So yes I/O buckets could be useful for the scenario tested
above. And regarding per-CPU vs per-NUMA weight calculation
do you agree per-CPU should be good enough for this policy
as we saw above per-NUMA doesn't help improve much performance?
> Lets also test what happens with multiple clients against the same subsystem.
Yes this is a good test to run, I will test and post result.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list