[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Sat Dec 27 01:33:06 PST 2025

On 26/12/2025 20:16, Nilay Shroff wrote:
>
> On 12/25/25 6:15 PM, Sagi Grimberg wrote:
>>
>> On 23/12/2025 16:50, Nilay Shroff wrote:
>>> [...]
>>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>>
>>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>>
>>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>>> suited for the normal case; I do wonder, though, if for high-speed
>>>> links we do see a difference as the data transfer time is getting
>>>> really fast...
>>>>
>>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>>> though I think for a very large I/O size, we would see a higer latency due
>>> to tcp segmentation and re-assembly.
>>>
>>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>>> for varying I/O size (captured for random-read direct I/O workload):
>>> I/O-size    Avg-latency(usec)
>>>    512            12113
>>>    1k             10058
>>>    2k             11246
>>>    4k             12458
>>>    8k             12189
>>>    16k            11617
>>>    32k            17686
>>>    64k            28504
>>>    128k           59013
>>>    256k           118984
>>>    512k           233428
>>>    1M             460000
>>>
>>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>>> above, latency increases significantly and roughly doubles with each step in
>>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>>> to preserve latency characteristics while avoiding excessive fragmentation of
>>> statistics. The suggested bucket layout is as follows:
>>>
>>> Bucket        block-size-range
>>> small        512B-32k
>>> medium        32k-64k
>>> large-64k    64k-128k
>>> large-128k    128k-256k
>>> large-256k    256k-512k
>>> large-512k    512k-1M
>>> very-large    >=1M
>>>
>>> In this model,
>>> - A single small bucket captures latency for I/O sizes where latency remains
>>>     largely uniform.
>>> - A medium bucket captures the transition region.
>>> - Separate large buckets preserve the rapidly increasing latency behavior
>>>     observed for larger block sizes.
>>> - A very-large bucket handles any I/O beyond 1M.
>>>
>>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>>> make sense?
>> Yes
>>
>>>> [ .. ]
>>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>>> stat per-hctx instead of per-CPU.
>>>>>>>
>>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>>> latency characteristics.
>>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>>> preserving a true end-to-end view of path latency, agreed?
>>>>>
>>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>>> But two of these paths will always be on the same NUMA node).
>>>> So that doesn't work out.
>>>>
>>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>>> different NVMe multipath I/O policies:
>>>>>
>>>>>            numa         round-robin   queue-depth  adaptive
>>>>>            -----------  -----------   -----------  ---------
>>>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>>>            W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>>>
>>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>>> comapred against existing policies.
>>>>>     
>>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>>> Talk to me about FPIN ...
>>>>
>>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>>> Below are the aggregated throughput results observed under different NVMe multipath
>>> I/O policies.
>>>
>>> i) Stressing all 32 cpus using stress-ng
>>>
>>> All 32 CPUs were stressed using:
>>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>>
>>>            numa          round-robin   queue-depth  adaptive
>>>            -----------   -----------   -----------  ---------
>>> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
>>> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
>>> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>>>            W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>>>
>>> ii) Symmetric paths (No CPU stress and no induced network load):
>>>
>>>            numa          round-robin   queue-depth   adaptive
>>>            -----------   -----------   -----------   ---------
>>> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
>>> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
>>> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>>>           W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>>>
>>> These results show that the adaptive I/O policy consistently delivers higher
>>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>>> paths the adaptive policy achieves throughput comparable to—or slightly
>>> better than—existing policies.
>> I still think that accounting uncorrelated latency is the best approach here.
>>
>> My intuition tells me that:
>> 1. averaging latencies over numa-node
>> 2. calculating weights
>> 3. distribute new weights per-cpu in the numa-node
>>
>> Is a better approach. It is hard to evaluate without adding some randomness.
>>
>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
> file I used for the test, followed by the observed throughput result for reference.
>
> Job file:
> =========
>
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> cpumode=qsort
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n2
> rw=<randread/randwrite/randrw>
> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
> iodepth=32
> numjobs=32
> direct=1
>
> Throughput:
> ===========
>
>           numa          round-robin   queue-depth    adaptive
>           -----------   -----------   -----------    ---------
> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>
> When comparing the results, I did not observe a significant throughput
> difference between the queue-depth, round-robin, and adaptive policies.
> With random I/O of mixed sizes, the adaptive policy appears to average
> out the varying latency values and distribute I/O reasonably evenly
> across the active paths (assuming symmetric paths).
>
> Next I'd implement I/O size buckets and also per-numa node weight and
> then rerun tests and share the result. Lets see if these changes help
> further improve the throughput number for adaptive policy. We may then
> again review the results and discuss further.

Two comments: