[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Hannes Reinecke
hare at suse.de
Wed Jan 7 03:15:16 PST 2026
On 1/4/26 22:06, Sagi Grimberg wrote:
>
>
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/
>>>>> `cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode.
>>>> Below is the job
>>>> file I used for the test, followed by the observed throughput result
>>>> for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>> numa round-robin queue-depth adaptive
>>>> ----------- ----------- ----------- ---------
>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes,
>>> and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged
>>> out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>> (CPU stress using cpuload):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-
>> buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- --------
>> -------------------
>> READ: 636 621 613 618
>> WRITE: 1832 1847 1840 1852
>> RW: R:872 R:869 R:866 R:874
>> W:872 W:870 W:867 W:876
>>
>> ii) Asymmetric paths + system load
>> (CPU stress using cpuload and iperf3 traffic for inducing network
>> congestion):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-
>> buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- --------
>> -------------------
>> READ: 553 543 540 533
>> WRITE: 1705 1670 1710 1655
>> RW: R:769 R:771 R:784 R:772
>> W:768 W:767 W:785 W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>> The per-CPU implementation already averages latency effectively
>> across CPUs.
>> Introducing per-CPU I/O buckets does not provide a meaningful
>> throughput
>> improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>> Calculating or averaging weights at the NUMA level does not
>> significantly
>> improve throughput over per-CPU weight calculation. Across both
>> symmetric
>> and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
>
> I think it is counter intuitive that bucketing I/O sizes does not
> present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
>
> Lets also test what happens with multiple clients against the same
> subsystem.
I am not sure if focussing on NUMA nodes will bring us an advantage
here. NUMA nodes would present an advantage if we can keep I/Os to
different controllers on different NUMA nodes; but with TCP this
is rarely possible (just think of two connections to different
controllers via the same interface ...), so I really think we
should keep the counters per-cpu.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list