[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Mon Feb 2 05:33:49 PST 2026
On 1/6/26 7:46 PM, Nilay Shroff wrote:
>
>
> On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>>
>>
>> On 04/01/2026 11:07, Nilay Shroff wrote:
>>>
>>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>>
>>>>> Job file:
>>>>> =========
>>>>>
>>>>> [global]
>>>>> time_based
>>>>> runtime=120
>>>>> group_reporting=1
>>>>>
>>>>> [cpu]
>>>>> ioengine=cpuio
>>>>> cpuload=85
>>>>> cpumode=qsort
>>>>> numjobs=32
>>>>>
>>>>> [disk]
>>>>> ioengine=io_uring
>>>>> filename=/dev/nvme1n2
>>>>> rw=<randread/randwrite/randrw>
>>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>>> iodepth=32
>>>>> numjobs=32
>>>>> direct=1
>>>>>
>>>>> Throughput:
>>>>> ===========
>>>>>
>>>>> numa round-robin queue-depth adaptive
>>>>> ----------- ----------- ----------- ---------
>>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>>
>>>>> When comparing the results, I did not observe a significant throughput
>>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>>> out the varying latency values and distribute I/O reasonably evenly
>>>>> across the active paths (assuming symmetric paths).
>>>>>
>>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>>> then rerun tests and share the result. Lets see if these changes help
>>>>> further improve the throughput number for adaptive policy. We may then
>>>>> again review the results and discuss further.
>>>>>
>>>>> Thanks,
>>>>> --Nilay
>>>> two comments:
>>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>>> the datapath does not introduce serialization).
>>> Thanks for the suggestions. I ran experiments incorporating both points—
>>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>>> weight calculation—using the following setup.
>>>
>>> Job file:
>>> =========
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n1
>>> rw=<randread/randwrite/randrw>
>>> bssplit=<based-on-I/O-pattern-type>[1]
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>> ==========
>>>
>>> [1] Block-size distributions:
>>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>>
>>> Results:
>>> =======
>>>
>>> i) Symmetric paths + system load
>>> (CPU stress using cpuload):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 636 621 613 618
>>> WRITE: 1832 1847 1840 1852
>>> RW: R:872 R:869 R:866 R:874
>>> W:872 W:870 W:867 W:876
>>>
>>> ii) Asymmetric paths + system load
>>> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 553 543 540 533
>>> WRITE: 1705 1670 1710 1655
>>> RW: R:769 R:771 R:784 R:772
>>> W:768 W:767 W:785 W:771
>>>
>>>
>>> Looking at the above results,
>>> - Per-CPU vs per-CPU with I/O buckets:
>>> The per-CPU implementation already averages latency effectively across CPUs.
>>> Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>> improvement and remains largely comparable.
>>>
>>> - Per-CPU vs per-NUMA aggregation:
>>> Calculating or averaging weights at the NUMA level does not significantly
>>> improve throughput over per-CPU weight calculation. Across both symmetric
>>> and asymmetric scenarios, the results remain very close.
>>>
>>> So now based on above results and assessment, unless there are additional
>>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>>> calculation for this new I/O policy?
>>
>> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
>> Maybe the test is not good enough of a representation...
>>
> Hmm you were correct, I also thought the same but I couldn't find
> any test which could prove the advantage using I/O buckets. Then
> today I spend some time thinking about the scenarios which could
> prove the worth using I/O buckets. After some thought I came up
> with following use case.
>
> Size-dependent path behavior:
>
> 1. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> Now running mixed I/O (bssplit => 16k/75:64k/25),
>
> Without buckets:
> Path B looks good; scheduler forwards more I/Os towards path B.
>
> With buckets:
> small I/Os are distributed across path A and B
> large I/Os favor path B
>
> So in theory, throughput shall improve with buckets.
>
> 2. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: opposite
>
> Without buckets:
> latency averages cancel out
> scheduler sees “paths are equal”
>
> With buckets:
> small I/O bucket favors A
> large I/O bucket favors B
>
> Again in theory, throughput shall improve with buckets.
>
> So with the above thought, I ran another experiment and results
> are shown below:
>
> Injecting additional delay on one path for larger packets (>=32k)
> and mixing I/Os with bssplit => 16k/75:64k/25. So with this
> test, we have,
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 550 622 523 615
> WRITE: 726 829 747 834
> RW: R:324 R:381 R: 306 R:375
> W:323 W:381 W: 306 W:374
>
> So yes I/O buckets could be useful for the scenario tested
> above. And regarding per-CPU vs per-NUMA weight calculation
> do you agree per-CPU should be good enough for this policy
> as we saw above per-NUMA doesn't help improve much performance?
>
>
>> Lets also test what happens with multiple clients against the same subsystem.
> Yes this is a good test to run, I will test and post result.
>
Finally, I was able to run tests with two nvmf-tcp hosts connected
to the same nvmf-tcp target. Apologies for the delay — setting up this
topology took some time, partly due to recent non-technical infrastructure
challenges after our lab relocation.
The goal of these tests was to evaluate per-CPU vs per-NUMA weight calculation,
with and without I/O size buckets, under multi-client contention.
I ran tests (randread, randwrite and randrw) with mixed I/O (using bssplit)
and added the CPU stress on hosts using cpuload as I already did for my
earlier tests. Please find below the test result and observation.
Workload characteristics:
=========================
- Workloads tested: randread, randwrite, randrw
- Mixed I/O sizes using bssplit
- CPU stress induced using cpuload
- Both hosts run workloads simultaneously
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
ramp-time=120
[1] Block-size distributions:
randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
Test topology:
==============
1. Two nvmf-tcp hosts connected to the same nvmf-tcp target
2. Each host connects to target using two symmetric paths
3. System load on each host is induced using cpuload (as shown in jobfile)
4. Both hosts run I/O workloads concurrently
Results:
=======
Host1:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 153 164 166 131
WRITE: 839 837 889 839
RW: R:249 R:255 R:226 R:256
W:247 W:254 W:225 W:253
Host2:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 268 258 279 268
WRITE: 1012 992 880 1017
RW: R:386 R:410 R:401 R:405
W:385 W:409 W:399 W:405
>From the above results, I have got the same impression as earlier while I ran the
similar tests between one nvmf-tcp host and target. Looking at the above results,
Per-CPU vs per-CPU with I/O buckets:
- The per-CPU implementation already averages latency effectively across CPUs.
- Introducing per-CPU I/O buckets does not provide a meaningful throughput
improvement in the general case.
- Results remain largely comparable across workloads and hosts.
- However, as shown in earlier experiments with I/O size–dependent path behavior,
I/O buckets can provide measurable benefits in specific scenarios.
Per-CPU vs per-NUMA aggregation:
- Calculating or averaging weights at the NUMA level does not significantly improve
throughput over per-CPU weight calculation.
- This holds true even under multi-host contention.
Based on all the tests conducted so far, including, symmetric and asymmetric paths,
CPU stress, size-dependent path behavior and multi-client access to the same target:
The results suggest that we should move forward with a per-CPU implementation using
I/O buckets. That said, I am open to any further feedback, suggestions, or additional
scenarios that might be worth evaluating.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list