[PATCH v7 0/6] nvme-fc: FPIN link integrity handling
Hannes Reinecke
hare at suse.de
Tue Jul 8 23:21:02 PDT 2025
On 7/8/25 21:56, John Meneghini wrote:
> On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>>> During path fail testing on the numa iopolicy, I found that I/O moves
>>> off of the marginal path after a first link integrity event is
>>> received, but if the non-marginal path the I/O is on is disconnected,
>>> the I/O is transferred onto a marginal path (in testing, sometimes
>>> I've seen it go to a "marginal optimized" path, and sometimes
>>> "marginal non-optimized").
>>>
>> That is by design.
>> 'marginal' paths are only evaluated for the 'optimized' path selection,
>> where it's obvious that 'marginal' paths should not be selected as
>> 'optimized'.
>
> I think we might want to change this. With the NUMA scheduler you can
> end up with using the non-optimized marginal path. This happens when
> we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
> paths to marginal. In this case> the NUMA scheduler should simply
choose the optimized marginal path on
> the closest numa node. However, that's not what happens. It consistently
> chooses the non-optimized first non-optimized path.>
Ah. So it seems that the NUMA scheduler needs to be fixed.
I'll have a look there.
>> For 'non-optimized' the situation is less clear; is the 'non-optimized'
>> path preferrable to 'marginal'? Or the other way round?
>> So once the 'optimized' path selection returns no paths, _any_ of the
>> remaining paths are eligible.
>
> This is a good question for Broadcom. I think, with all IO schedulers,
> as long
> as there is a non-marginal path available, that path should be used. So
> a non-marginal non-optimized path should take precedence over a marginal
> optimized path.
>
> In the case were all paths are marginal, I think the scheduler should
> simply use the first optimized path on the closest numa node.
For the NUMA case, yes. But as I said above, it seems that the NUMA
scheduler needs to fixes.
>>> The queue-depth iopolicy doesn't change its path selection based on
>>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>>> that there's currently no logic to handle marginal paths. We're
>>> developing a patch to address that issue in queue-depth, but we need
>>> to do more testing.
>>>
>> Again, by design.
>> The whole point of the marginal path patchset is that I/O should
>> be steered away from the marginal path, but the path itself should
>> not completely shut off (otherwise we just could have declared the
>> path as 'faulty' and be done with).
>> Any I/O on 'marginal' paths should have higher latencies, and higher
>> latencies should result in higher queue depths on these paths. So
>> the queue-depth based IO scheduler should to the right thing
>> automatically.
>
> I don't understand this. The Round-robin scheduler removes marginal
> paths, why shouldn't the queue-depth and numa scheduler do the same?
>
The NUMA scheduler should, that's correct.
>> Always assuming that marginal paths should have higher latencies.
>> If they haven't then they will be happily selection for I/O.
>> But then again, if the marginal path does _not_ have highert
>> latencies why shouldn't we select it for I/O?
>
> This may be true with FPIN Congestion Signal, but we are testing Link
> Integrity. With FPIN LI I think we want to simply stop using the path.
> LI has nothing to do with latency. So unless ALL paths are marginal,
> the IO scheduler should not be using any marginal path.
>
For FPIN LI the paths should be marked as 'faulty', true.
> Do we need another state here? There is an ask to support FPIN CS, so
> maybe using the term "marginal" to describe LI is wrong.
>
> Maybe we need something like "marginal_li" and "marginal_cs" to describe
> the difference.
>
Really not so sure. I really wonder how a FPIN LI event reflect back on
the actual I/O. Will the I/O be aborted with an error? Or does the I/O
continue at a slower pace?
I would think the latter, and that's the design assumption for this
patchset. If it's the former and I/O is aborted with an error we are in
a similar situation like we have with a faulty cable, and we need
to come up with a different solution.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list