[PATCH v7 0/6] nvme-fc: FPIN link integrity handling

Wed Jul 9 10:44:56 PDT 2025

On 7/9/25 15:42, Bryan Gurney wrote:
> On Wed, Jul 9, 2025 at 2:21 AM Hannes Reinecke <hare at suse.de> wrote:
>>
[ .. ]
>>> This may be true with FPIN Congestion Signal, but we are testing Link
>>> Integrity.  With FPIN LI I think we want to simply stop using the path.
>>> LI has nothing to do with latency.  So unless ALL paths are marginal,
>>> the IO scheduler should not be using any marginal path.
>>>
>> For FPIN LI the paths should be marked as 'faulty', true.
>>
>>> Do we need another state here?  There is an ask to support FPIN CS, so
>>> maybe using the term "marginal" to describe LI is wrong.
>>>
>>> Maybe we need something like "marginal_li" and "marginal_cs" to describe
>>> the difference.
>>>
>> Really not so sure. I really wonder how a FPIN LI event reflect back on
>> the actual I/O. Will the I/O be aborted with an error? Or does the I/O
>> continue at a slower pace?
>> I would think the latter, and that's the design assumption for this
>> patchset. If it's the former and I/O is aborted with an error we are in
>> a similar situation like we have with a faulty cable, and we need
>> to come up with a different solution.
>>
> 
> During my testing, I was watching the logs on the test host as I was
> about to run the command on the switch to generate the FPIN LI event.
> I didn't see any I/O errors, and the I/O continued at the normally
> expected throughput and latency.  But "if this had been an actual
> emergency..." as the saying goes, there would probably be some kind of
> disruption that the event itself would be expected to cause (e.g.:
> "loss sync", "loss signal", "link failure"), but
> 
> There was a Storage Developer Conference 21 presentation slide deck on
> the FPIN LI events that's hosted on the SNIA website [1]; slide 4
> shows the problem statements addressed by the notifications.
> 
> In my previous career as a system administrator, I remember seeing
> strange performance slowdowns on high-volume database servers, and on
> searching through the logs, I might find an event from the database
> engine about an I/O operating taking over 30 seconds to complete.
> Meanwhile, the application using the database was backlogged due to
> its queries taking longer, for what ended up being a faulty SFP.
> After replacing that, we could get the application running again, to
> process its replication and workload backlogs.  The link integrity
> events could help alert on these link problems before they turn into
> application disruptions.
> 
But that's precisely it, isn't it?
If it's a straight error the path/controller is being reset, and
really there's nothing for us to be done.
If it's an FPIN LI _without_ any performance impact, why shouldn't
we continue to use that path? Would there be any impact if we do?
And if it's an FPIN LI with _any_ sort of performance impact
(or a performance impact which might happen eventually) the
current approach of steering away I/O should be fine.
Am I missing something?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich