[PATCH v7 0/6] nvme-fc: FPIN link integrity handling
John Meneghini
jmeneghi at redhat.com
Tue Jul 8 12:56:13 PDT 2025
On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>> During path fail testing on the numa iopolicy, I found that I/O moves
>> off of the marginal path after a first link integrity event is
>> received, but if the non-marginal path the I/O is on is disconnected,
>> the I/O is transferred onto a marginal path (in testing, sometimes
>> I've seen it go to a "marginal optimized" path, and sometimes
>> "marginal non-optimized").
>>
> That is by design.
> 'marginal' paths are only evaluated for the 'optimized' path selection,
> where it's obvious that 'marginal' paths should not be selected as
> 'optimized'.
I think we might want to change this. With the NUMA scheduler you can end up with
using the non-optimized marginal path. This happens when we test with 4 paths
(2 optimized and 2 non-optimized) and set all 4 paths to marginal. In this case
the NUMA scheduler should simply choose the optimized marginal path on the closest
numa node. However, that's not what happens. It consistently chooses the
non-optimized first non-optimized path.
> For 'non-optimized' the situation is less clear; is the 'non-optimized'
> path preferrable to 'marginal'? Or the other way round?
> So once the 'optimized' path selection returns no paths, _any_ of the
> remaining paths are eligible.
This is a good question for Broadcom. I think, with all IO schedulers, as long
as there is a non-marginal path available, that path should be used. So
a non-marginal non-optimized path should take precedence over a marginal optimized path.
In the case were all paths are marginal, I think the scheduler should simply use the firt
optimized path on the closest numa node.
>> The queue-depth iopolicy doesn't change its path selection based on
>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>> that there's currently no logic to handle marginal paths. We're
>> developing a patch to address that issue in queue-depth, but we need
>> to do more testing.
>>
> Again, by design.
> The whole point of the marginal path patchset is that I/O should
> be steered away from the marginal path, but the path itself should
> not completely shut off (otherwise we just could have declared the
> path as 'faulty' and be done with).
> Any I/O on 'marginal' paths should have higher latencies, and higher
> latencies should result in higher queue depths on these paths. So
> the queue-depth based IO scheduler should to the right thing
> automatically.
I don't understand this. The Round-robin scheduler removes marginal paths, why shouldn't the
queue-depth and numa scheduler do the same?
> Always assuming that marginal paths should have higher latencies.
> If they haven't then they will be happily selection for I/O.
> But then again, if the marginal path does _not_ have highert
> latencies why shouldn't we select it for I/O?
This may be true with FPIN Congestion Signal, but we are testing Link Integrity. With FPIN LI I think we want to simply stop using the path.
LI has nothing to do with latency. So unless ALL paths are marginal, the IO scheduler should not be using any marginal path.
Do we need another state here? There is an ask to support FPIN CS, so maybe using the term "marginal" to describe LI is wrong.
Maybe we need something like "marginal_li" and "marginal_cs" to describe the difference.
/John
More information about the Linux-nvme
mailing list