[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Sagi Grimberg
sagi at grimberg.me
Thu Dec 25 04:45:04 PST 2025
On 23/12/2025 16:50, Nilay Shroff wrote:
> [...]
>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>
>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>
>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>> suited for the normal case; I do wonder, though, if for high-speed
>> links we do see a difference as the data transfer time is getting
>> really fast...
>>
> For a high speed/bandwidth NIC card the transfer speed would be very fast,
> though I think for a very large I/O size, we would see a higer latency due
> to tcp segmentation and re-assembly.
>
> On my nvmf-tcp testbed, I do see the latency differences as shown below
> for varying I/O size (captured for random-read direct I/O workload):
> I/O-size Avg-latency(usec)
> 512 12113
> 1k 10058
> 2k 11246
> 4k 12458
> 8k 12189
> 16k 11617
> 32k 17686
> 64k 28504
> 128k 59013
> 256k 118984
> 512k 233428
> 1M 460000
>
> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
> above, latency increases significantly and roughly doubles with each step in
> block size. Based on this data, I propose using coarse-grained I/O size buckets
> to preserve latency characteristics while avoiding excessive fragmentation of
> statistics. The suggested bucket layout is as follows:
>
> Bucket block-size-range
> small 512B-32k
> medium 32k-64k
> large-64k 64k-128k
> large-128k 128k-256k
> large-256k 256k-512k
> large-512k 512k-1M
> very-large >=1M
>
> In this model,
> - A single small bucket captures latency for I/O sizes where latency remains
> largely uniform.
> - A medium bucket captures the transition region.
> - Separate large buckets preserve the rapidly increasing latency behavior
> observed for larger block sizes.
> - A very-large bucket handles any I/O beyond 1M.
>
> This approach allows the adaptive policy to retain meaningful latency distinctions across
> I/O size regimes while keeping the number of buckets manageable and statistically stable,
> make sense?
Yes
>
>> [ .. ]
>>>>> I understand your concern about whether it really makes sense to keep this
>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>> stat per-hctx instead of per-CPU.
>>>>>
>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>> latency characteristics.
>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>> then accessing these weights in the fast-path is still cheap enough?
>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>> scope of what we are trying to measure, as it would largely exclude components of
>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>> actual I/O cost observed by the workload, which includes not only path and controller
>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>> preserving a true end-to-end view of path latency, agreed?
>>>
>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>> But two of these paths will always be on the same NUMA node).
>> So that doesn't work out.
>>
>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>> different NVMe multipath I/O policies:
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>>
>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>> comapred against existing policies.
>>>
>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>> Talk to me about FPIN ...
>>
> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
> Below are the aggregated throughput results observed under different NVMe multipath
> I/O policies.
>
> i) Stressing all 32 cpus using stress-ng
>
> All 32 CPUs were stressed using:
> # stress-ng --cpu 0 --cpu-method all -t 60m
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
> WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
> RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
> W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
>
> ii) Symmetric paths (No CPU stress and no induced network load):
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
> WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
> RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
> W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
>
> These results show that the adaptive I/O policy consistently delivers higher
> throughput under CPU stress and asymmetric path conditions. In case of symmetric
> paths the adaptive policy achieves throughput comparable to—or slightly
> better than—existing policies.
I still think that accounting uncorrelated latency is the best approach
here.
My intuition tells me that:
1. averaging latencies over numa-node
2. calculating weights
3. distribute new weights per-cpu in the numa-node
Is a better approach. It is hard to evaluate without adding some randomness.
Can you please run benchmarks with
`blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
More information about the Linux-nvme
mailing list