[PATCH v7 0/6] nvme-fc: FPIN link integrity handling
Bryan Gurney
bgurney at redhat.com
Wed Jul 9 06:42:38 PDT 2025
On Wed, Jul 9, 2025 at 2:21 AM Hannes Reinecke <hare at suse.de> wrote:
>
> On 7/8/25 21:56, John Meneghini wrote:
> > On 7/2/25 2:10 AM, Hannes Reinecke wrote:
> >>> During path fail testing on the numa iopolicy, I found that I/O moves
> >>> off of the marginal path after a first link integrity event is
> >>> received, but if the non-marginal path the I/O is on is disconnected,
> >>> the I/O is transferred onto a marginal path (in testing, sometimes
> >>> I've seen it go to a "marginal optimized" path, and sometimes
> >>> "marginal non-optimized").
> >>>
> >> That is by design.
> >> 'marginal' paths are only evaluated for the 'optimized' path selection,
> >> where it's obvious that 'marginal' paths should not be selected as
> >> 'optimized'.
> >
> > I think we might want to change this. With the NUMA scheduler you can
> > end up with using the non-optimized marginal path. This happens when
> > we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
> > paths to marginal. In this case> the NUMA scheduler should simply
> choose the optimized marginal path on
> > the closest numa node. However, that's not what happens. It consistently
> > chooses the non-optimized first non-optimized path.>
> Ah. So it seems that the NUMA scheduler needs to be fixed.
> I'll have a look there.
>
> >> For 'non-optimized' the situation is less clear; is the 'non-optimized'
> >> path preferrable to 'marginal'? Or the other way round?
> >> So once the 'optimized' path selection returns no paths, _any_ of the
> >> remaining paths are eligible.
> >
> > This is a good question for Broadcom. I think, with all IO schedulers,
> > as long
> > as there is a non-marginal path available, that path should be used. So
> > a non-marginal non-optimized path should take precedence over a marginal
> > optimized path.
> >
> > In the case were all paths are marginal, I think the scheduler should
> > simply use the first optimized path on the closest numa node.
>
> For the NUMA case, yes. But as I said above, it seems that the NUMA
> scheduler needs to fixes.
>
> >>> The queue-depth iopolicy doesn't change its path selection based on
> >>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
> >>> that there's currently no logic to handle marginal paths. We're
> >>> developing a patch to address that issue in queue-depth, but we need
> >>> to do more testing.
> >>>
> >> Again, by design.
> >> The whole point of the marginal path patchset is that I/O should
> >> be steered away from the marginal path, but the path itself should
> >> not completely shut off (otherwise we just could have declared the
> >> path as 'faulty' and be done with).
> >> Any I/O on 'marginal' paths should have higher latencies, and higher
> >> latencies should result in higher queue depths on these paths. So
> >> the queue-depth based IO scheduler should to the right thing
> >> automatically.
> >
> > I don't understand this. The Round-robin scheduler removes marginal
> > paths, why shouldn't the queue-depth and numa scheduler do the same?
> >
> The NUMA scheduler should, that's correct.
>
> >> Always assuming that marginal paths should have higher latencies.
> >> If they haven't then they will be happily selection for I/O.
> >> But then again, if the marginal path does _not_ have highert
> >> latencies why shouldn't we select it for I/O?
> >
> > This may be true with FPIN Congestion Signal, but we are testing Link
> > Integrity. With FPIN LI I think we want to simply stop using the path.
> > LI has nothing to do with latency. So unless ALL paths are marginal,
> > the IO scheduler should not be using any marginal path.
> >
> For FPIN LI the paths should be marked as 'faulty', true.
>
> > Do we need another state here? There is an ask to support FPIN CS, so
> > maybe using the term "marginal" to describe LI is wrong.
> >
> > Maybe we need something like "marginal_li" and "marginal_cs" to describe
> > the difference.
> >
> Really not so sure. I really wonder how a FPIN LI event reflect back on
> the actual I/O. Will the I/O be aborted with an error? Or does the I/O
> continue at a slower pace?
> I would think the latter, and that's the design assumption for this
> patchset. If it's the former and I/O is aborted with an error we are in
> a similar situation like we have with a faulty cable, and we need
> to come up with a different solution.
>
During my testing, I was watching the logs on the test host as I was
about to run the command on the switch to generate the FPIN LI event.
I didn't see any I/O errors, and the I/O continued at the normally
expected throughput and latency. But "if this had been an actual
emergency..." as the saying goes, there would probably be some kind of
disruption that the event itself would be expected to cause (e.g.:
"loss sync", "loss signal", "link failure"), but
There was a Storage Developer Conference 21 presentation slide deck on
the FPIN LI events that's hosted on the SNIA website [1]; slide 4
shows the problem statements addressed by the notifications.
In my previous career as a system administrator, I remember seeing
strange performance slowdowns on high-volume database servers, and on
searching through the logs, I might find an event from the database
engine about an I/O operating taking over 30 seconds to complete.
Meanwhile, the application using the database was backlogged due to
its queries taking longer, for what ended up being a faulty SFP.
After replacing that, we could get the application running again, to
process its replication and workload backlogs. The link integrity
events could help alert on these link problems before they turn into
application disruptions.
Thanks,
Bryan
[1] https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-Johnson-Introducing-Fabric-Notifications-From-Awareness-to-Action.pdf
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare at suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>
More information about the Linux-nvme
mailing list