[RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Tue Dec 23 06:50:32 PST 2025
[...]
>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>> have much lower amortized latency per 512 block. which could create an false bias
>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>
>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>
> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
> suited for the normal case; I do wonder, though, if for high-speed
> links we do see a difference as the data transfer time is getting
> really fast...
>
For a high speed/bandwidth NIC card the transfer speed would be very fast,
though I think for a very large I/O size, we would see a higer latency due
to tcp segmentation and re-assembly.
On my nvmf-tcp testbed, I do see the latency differences as shown below
for varying I/O size (captured for random-read direct I/O workload):
I/O-size Avg-latency(usec)
512 12113
1k 10058
2k 11246
4k 12458
8k 12189
16k 11617
32k 17686
64k 28504
128k 59013
256k 118984
512k 233428
1M 460000
As can be seen, for smaller block sizes (512B–16K), latency remains relatively
stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
above, latency increases significantly and roughly doubles with each step in
block size. Based on this data, I propose using coarse-grained I/O size buckets
to preserve latency characteristics while avoiding excessive fragmentation of
statistics. The suggested bucket layout is as follows:
Bucket block-size-range
small 512B-32k
medium 32k-64k
large-64k 64k-128k
large-128k 128k-256k
large-256k 256k-512k
large-512k 512k-1M
very-large >=1M
In this model,
- A single small bucket captures latency for I/O sizes where latency remains
largely uniform.
- A medium bucket captures the transition region.
- Separate large buckets preserve the rapidly increasing latency behavior
observed for larger block sizes.
- A very-large bucket handles any I/O beyond 1M.
This approach allows the adaptive policy to retain meaningful latency distinctions across
I/O size regimes while keeping the number of buckets manageable and statistically stable,
make sense?
> [ .. ]
>>>> I understand your concern about whether it really makes sense to keep this
>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>> stat per-hctx instead of per-CPU.
>>>>
>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>> latency characteristics.
>>>
>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>> then accessing these weights in the fast-path is still cheap enough?
>>
>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>> scope of what we are trying to measure, as it would largely exclude components of
>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>> actual I/O cost observed by the workload, which includes not only path and controller
>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>> preserving a true end-to-end view of path latency, agreed?
>>
> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
> But two of these paths will always be on the same NUMA node).
> So that doesn't work out.
>
>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>> ioengine=io_uring. Below are the aggregated throughput results observed under
>> different NVMe multipath I/O policies:
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>
>> These results show that under combined CPU and network stress, the adaptive I/O policy
>> consistently delivers higher throughput across read, write, and mixed workloads when
>> comapred against existing policies.
>>
> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
> Talk to me about FPIN ...
>
I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
Below are the aggregated throughput results observed under different NVMe multipath
I/O policies.
i) Stressing all 32 cpus using stress-ng
All 32 CPUs were stressed using:
# stress-ng --cpu 0 --cpu-method all -t 60m
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
ii) Symmetric paths (No CPU stress and no induced network load):
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
These results show that the adaptive I/O policy consistently delivers higher
throughput under CPU stress and asymmetric path conditions. In case of symmetric
paths the adaptive policy achieves throughput comparable to—or slightly
better than—existing policies.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list