[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy

Thu Dec 25 04:45:04 PST 2025

On 23/12/2025 16:50, Nilay Shroff wrote:
> [...]
>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>
>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>
>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>> suited for the normal case; I do wonder, though, if for high-speed
>> links we do see a difference as the data transfer time is getting
>> really fast...
>>
> For a high speed/bandwidth NIC card the transfer speed would be very fast,
> though I think for a very large I/O size, we would see a higer latency due
> to tcp segmentation and re-assembly.
>
> On my nvmf-tcp testbed, I do see the latency differences as shown below
> for varying I/O size (captured for random-read direct I/O workload):
> I/O-size	Avg-latency(usec)
>   512            12113
>   1k             10058
>   2k             11246
>   4k             12458
>   8k             12189
>   16k            11617
>   32k            17686
>   64k            28504
>   128k           59013
>   256k           118984
>   512k           233428
>   1M             460000
>
> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
> above, latency increases significantly and roughly doubles with each step in
> block size. Based on this data, I propose using coarse-grained I/O size buckets
> to preserve latency characteristics while avoiding excessive fragmentation of
> statistics. The suggested bucket layout is as follows:
>
> Bucket		block-size-range
> small		512B-32k
> medium		32k-64k
> large-64k	64k-128k
> large-128k	128k-256k
> large-256k	256k-512k
> large-512k	512k-1M
> very-large	>=1M
>
> In this model,
> - A single small bucket captures latency for I/O sizes where latency remains
>    largely uniform.
> - A medium bucket captures the transition region.
> - Separate large buckets preserve the rapidly increasing latency behavior
>    observed for larger block sizes.
> - A very-large bucket handles any I/O beyond 1M.
>
> This approach allows the adaptive policy to retain meaningful latency distinctions across
> I/O size regimes while keeping the number of buckets manageable and statistically stable,
> make sense?

Yes

>
>> [ .. ]
>>>>> I understand your concern about whether it really makes sense to keep this
>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>> stat per-hctx instead of per-CPU.
>>>>>
>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>> latency characteristics.
>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>> then accessing these weights in the fast-path is still cheap enough?
>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>> scope of what we are trying to measure, as it would largely exclude components of
>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>> actual I/O cost observed by the workload, which includes not only path and controller
>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>> preserving a true end-to-end view of path latency, agreed?
>>>
>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>> But two of these paths will always be on the same NUMA node).
>> So that doesn't work out.
>>
>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>> different NVMe multipath I/O policies:
>>>
>>>           numa         round-robin   queue-depth  adaptive
>>>           -----------  -----------   -----------  ---------
>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>           W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>
>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>> comapred against existing policies.
>>>    
>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>> Talk to me about FPIN ...
>>
> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
> Below are the aggregated throughput results observed under different NVMe multipath
> I/O policies.
>
> i) Stressing all 32 cpus using stress-ng
>
> All 32 CPUs were stressed using:
> # stress-ng --cpu 0 --cpu-method all -t 60m
>
>           numa          round-robin   queue-depth  adaptive
>           -----------   -----------   -----------  ---------
> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>           W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>
> ii) Symmetric paths (No CPU stress and no induced network load):
>
>           numa          round-robin   queue-depth   adaptive
>           -----------   -----------   -----------   ---------
> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>          W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>
> These results show that the adaptive I/O policy consistently delivers higher
> throughput under CPU stress and asymmetric path conditions. In case of symmetric
> paths the adaptive policy achieves throughput comparable to—or slightly
> better than—existing policies.

I still think that accounting uncorrelated latency is the best approach 
here.

My intuition tells me that:
1. averaging latencies over numa-node
2. calculating weights
3. distribute new weights per-cpu in the numa-node

Is a better approach. It is hard to evaluate without adding some randomness.

Can you please run benchmarks with 
`blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?