[RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy

Tue Sep 23 00:03:11 PDT 2025

On 9/23/25 05:43, Nilay Shroff wrote:
> 
> 
> On 9/22/25 1:00 PM, Hannes Reinecke wrote:
>> On 9/21/25 13:12, Nilay Shroff wrote:
[ .. ]>>> +        srcu_idx = srcu_read_lock(&head->srcu);
>>> +        list_for_each_entry_srcu(cur_ns, &head->list, siblings,
>>> +                srcu_read_lock_held(&head->srcu)) {
>>
>> And this is even more awkward as we need to iterate over all paths
>> (during completion!).
>>
> Hmm yes, but we only iterate once every ~15 seconds per CPU, so the overhead is minimal.
> Typically we don’t have a large number of paths to deal with: enterprise SSDs usually
> expose at most two controllers, and even in fabrics setups the path count is usually
> limited to around 4–6. So the loop should run quite fast.

Hmm. Not from my experience. There is at least one implementation from a
rather substantial array vendor exposing up to low hundreds of queues.

> Also, looping in itself isn’t unusual — for example, the queue-depth I/O policy already
> iterates over all paths in the submission path to check queue depth before dispatching each
> I/O. That said, if looping in the completion path is still a concern, we could consider
> moving this into a dedicated worker thread instead. What do you think?
> 

Not sure if that's a good idea; either the worker thread runs
asynchronous to the completion and then we have to deal with reliably
adding up numbers, or we're running synchronous and lose performance.
Still think that _not_ iterating and just adding up single-cpu latencies
might be worthwhile.

>> Do we really need to do this?
>> What would happen if we just measure the latency on the local CPU
>> and do away with this loop?
>> We would have less samples, true, but we would even be able to
>> not only differentiate between distinct path latency but also between
>> different CPU latencies; I would think this being a bonus for
>> multi-socket machines.
>>
> The idea is to keep per-cpu view consistent for each path. As we know,
> in NVMe/fabrics multipath, submission and completion CPUs don’t necessarily
> match (depends on the host’s irq/core mapping). And so if we were to measure
> the latency/EWMA locally per-cpu then the per-CPU accumulator might be biased
> towards the completion CPU, not the submission CPU. For instance, if submission
> is on CPU A but completion lands on CPU B, then CPU A’s weights never reflect
> it's I/O experience — they’ll be skewed by how interrupts get steered.
> 
True. Problem is that for the #CPUs > #queues we're setting up a cpu
affinity group, and interrupts are directed to one of the CPU in that
group. I had hoped that the blk-mq code would raise a softirq in that
case and call .end_request on the cpu registered in the request itself.
Probably need to be evaluated.

> So on a multi socket/NUMA systems, depending on topology, calculating local
> per-cpu ewma/latency may or may not line up. For example:
> 
> - If we have #cpu <= #vectors supported by NVMe disk then typically
>    we have 1:1 mapping between submission and completion queues and hence all completions for
>    a queue are steered to the same CPU that also submits, then per-CPU stats are accurate.
> 
> - But when #CPUs > #vectors, completions may be centralized or spread differently. In that
>    case, the per-CPU latency view can be distorted — e.g., CPU A may submit, but CPU B takes
>    completions, so CPU A’s weights never reflect its own I/O behavior.
> 
See above. We might check if blk-mq doesn't cover for this case already.
Thing is, I actually _do_ want to measure per-CPU latency.
On a multi-socket system it really does matter whether an I/O is run
from a CPU on the socket attached to the PCI device, or from an
off-socket CPU. If we are calculating just the per-path latency
we completely miss that (as blk-mq will spread out across _all_
cpus), but if we are measuring a per-cpu latency we will end up
with a differential matrix where cpus with the lowest latency
will be preferred.
So if we have a system with two sockets and two PCI HBAs, each
connected to a different socket, using per-path latency will be
spreading out I/Os across all cpus. Using per-cpu latency will
direct I/Os to the cpus with the lowest latency, preferring
the local cpus.

>> _And_ we wouldn't need to worry about path failures, which is bound
>> to expose some race conditions if we need to iterate paths at the
>> same time than path failures are being handled.
>>
> Yes agreed we may have some race here and so the path score/weight may be
> skewed when that happens but then that'd be auto-corrected in the next epoc
> (after ~15 sec) when we re-calculate the path weight/score again, isn't it?
> 
Let's see. I still would want to check if we can't do per-cpu
statistics, as that would automatically avoid any races :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich