[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Sun Jan 4 01:07:48 PST 2026
On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>
>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>> file I used for the test, followed by the observed throughput result for reference.
>>
>> Job file:
>> =========
>>
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> cpumode=qsort
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n2
>> rw=<randread/randwrite/randrw>
>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>> iodepth=32
>> numjobs=32
>> direct=1
>>
>> Throughput:
>> ===========
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>
>> When comparing the results, I did not observe a significant throughput
>> difference between the queue-depth, round-robin, and adaptive policies.
>> With random I/O of mixed sizes, the adaptive policy appears to average
>> out the varying latency values and distribute I/O reasonably evenly
>> across the active paths (assuming symmetric paths).
>>
>> Next I'd implement I/O size buckets and also per-numa node weight and
>> then rerun tests and share the result. Lets see if these changes help
>> further improve the throughput number for adaptive policy. We may then
>> again review the results and discuss further.
>>
>> Thanks,
>> --Nilay
>
> two comments:
> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
> the datapath does not introduce serialization).
Thanks for the suggestions. I ran experiments incorporating both points—
biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
weight calculation—using the following setup.
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
==========
[1] Block-size distributions:
randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
Results:
=======
i) Symmetric paths + system load
(CPU stress using cpuload):
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 636 621 613 618
WRITE: 1832 1847 1840 1852
RW: R:872 R:869 R:866 R:874
W:872 W:870 W:867 W:876
ii) Asymmetric paths + system load
(CPU stress using cpuload and iperf3 traffic for inducing network congestion):
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 553 543 540 533
WRITE: 1705 1670 1710 1655
RW: R:769 R:771 R:784 R:772
W:768 W:767 W:785 W:771
Looking at the above results,
- Per-CPU vs per-CPU with I/O buckets:
The per-CPU implementation already averages latency effectively across CPUs.
Introducing per-CPU I/O buckets does not provide a meaningful throughput
improvement and remains largely comparable.
- Per-CPU vs per-NUMA aggregation:
Calculating or averaging weights at the NUMA level does not significantly
improve throughput over per-CPU weight calculation. Across both symmetric
and asymmetric scenarios, the results remain very close.
So now based on above results and assessment, unless there are additional
scenarios or metrics of interest, shall we proceed with per-CPU weight
calculation for this new I/O policy?
Thanks,
--Nilay
More information about the Linux-nvme
mailing list