[PATCH v7 0/6] nvme-fc: FPIN link integrity handling

Tue Jul 8 23:21:02 PDT 2025

On 7/8/25 21:56, John Meneghini wrote:
> On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>>> During path fail testing on the numa iopolicy, I found that I/O moves
>>> off of the marginal path after a first link integrity event is
>>> received, but if the non-marginal path the I/O is on is disconnected,
>>> the I/O is transferred onto a marginal path (in testing, sometimes
>>> I've seen it go to a "marginal optimized" path, and sometimes
>>> "marginal non-optimized").
>>>
>> That is by design.
>> 'marginal' paths are only evaluated for the 'optimized' path selection,
>> where it's obvious that 'marginal' paths should not be selected as
>> 'optimized'.
> 
> I think we might want to change this.  With the NUMA scheduler you can 
> end up with using the non-optimized marginal path.  This happens when
 > we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
 > paths to marginal.  In this case> the NUMA scheduler should simply 
choose the optimized marginal path on
> the closest numa node.  However, that's not what happens. It consistently
 > chooses the non-optimized first non-optimized path.>
Ah. So it seems that the NUMA scheduler needs to be fixed.
I'll have a look there.

>> For 'non-optimized' the situation is less clear; is the 'non-optimized'
>> path preferrable to 'marginal'? Or the other way round?
>> So once the 'optimized' path selection returns no paths, _any_ of the
>> remaining paths are eligible.
> 
> This is a good question for Broadcom.  I think, with all IO schedulers, 
> as long
> as there is a non-marginal path available, that path should be used.  So
> a non-marginal non-optimized path should take precedence over a marginal 
> optimized path.
> 
> In the case were all paths are marginal, I think the scheduler should 
> simply use the first optimized path on the closest numa node.

For the NUMA case, yes. But as I said above, it seems that the NUMA
scheduler needs to fixes.

>>> The queue-depth iopolicy doesn't change its path selection based on
>>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>>> that there's currently no logic to handle marginal paths.  We're
>>> developing a patch to address that issue in queue-depth, but we need
>>> to do more testing.
>>>
>> Again, by design.
>> The whole point of the marginal path patchset is that I/O should
>> be steered away from the marginal path, but the path itself should
>> not completely shut off (otherwise we just could have declared the
>> path as 'faulty' and be done with).
>> Any I/O on 'marginal' paths should have higher latencies, and higher
>> latencies should result in higher queue depths on these paths. So
>> the queue-depth based IO scheduler should to the right thing
>> automatically.
> 
> I don't understand this.  The Round-robin scheduler removes marginal 
> paths, why shouldn't the queue-depth and numa scheduler do the same?
> 
The NUMA scheduler should, that's correct.

>> Always assuming that marginal paths should have higher latencies.
>> If they haven't then they will be happily selection for I/O.
>> But then again, if the marginal path does _not_ have highert
>> latencies why shouldn't we select it for I/O?
> 
> This may be true with FPIN Congestion Signal, but we are testing Link 
> Integrity.  With FPIN LI I think we want to simply stop using the path.
> LI has nothing to do with latency.  So unless ALL paths are marginal, 
> the IO scheduler should not be using any marginal path.
> 
For FPIN LI the paths should be marked as 'faulty', true.

> Do we need another state here?  There is an ask to support FPIN CS, so 
> maybe using the term "marginal" to describe LI is wrong.
> 
> Maybe we need something like "marginal_li" and "marginal_cs" to describe 
> the difference.
> 
Really not so sure. I really wonder how a FPIN LI event reflect back on
the actual I/O. Will the I/O be aborted with an error? Or does the I/O
continue at a slower pace?
I would think the latter, and that's the design assumption for this
patchset. If it's the former and I/O is aborted with an error we are in
a similar situation like we have with a faulty cable, and we need
to come up with a different solution.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich