[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Wed Jan 7 03:15:16 PST 2026

On 1/4/26 22:06, Sagi Grimberg wrote:
> 
> 
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/ 
>>>>> `cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. 
>>>> Below is the job
>>>> file I used for the test, followed by the observed throughput result 
>>>> for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>>            numa          round-robin   queue-depth    adaptive
>>>>            -----------   -----------   -----------    ---------
>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, 
>>> and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged 
>>> out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>>     (CPU stress using cpuload):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO- 
>> buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   
>> -------------------
>> READ:    636          621                   613           618
>> WRITE:   1832         1847                  1840          1852
>> RW:      R:872        R:869                 R:866         R:874
>>           W:872        W:870                 W:867         W:876
>>
>> ii) Asymmetric paths + system load
>>     (CPU stress using cpuload and iperf3 traffic for inducing network 
>> congestion):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO- 
>> buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   
>> -------------------
>> READ:    553          543                   540           533
>> WRITE:   1705         1670                  1710          1655
>> RW:      R:769        R:771                 R:784         R:772
>>           W:768        W:767                 W:785         W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>>    The per-CPU implementation already averages latency effectively 
>> across CPUs.
>>    Introducing per-CPU I/O buckets does not provide a meaningful 
>> throughput
>>    improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>>    Calculating or averaging weights at the NUMA level does not 
>> significantly
>>    improve throughput over per-CPU weight calculation. Across both 
>> symmetric
>>    and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
> 
> I think it is counter intuitive that bucketing I/O sizes does not 
> present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
> 
> Lets also test what happens with multiple clients against the same 
> subsystem.

I am not sure if focussing on NUMA nodes will bring us an advantage 
here. NUMA nodes would present an advantage if we can keep I/Os to
different controllers on different NUMA nodes; but with TCP this
is rarely possible (just think of two connections to different
controllers via the same interface ...), so I really think we
should keep the counters per-cpu.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich