[PATCH v7 0/6] nvme-fc: FPIN link integrity handling

Hannes Reinecke hare at suse.de
Tue Jul 1 23:10:20 PDT 2025


On 7/1/25 22:32, Bryan Gurney wrote:
> On Tue, Jun 24, 2025 at 4:20 PM Bryan Gurney <bgurney at redhat.com> wrote:
>>
>> FPIN LI (link integrity) messages are received when the attached
>> fabric detects hardware errors. In response to these messages I/O
>> should be directed away from the affected ports, and only used
>> if the 'optimized' paths are unavailable.
>> Upon port reset the paths should be put back in service as the
>> affected hardware might have been replaced.
>> This patch adds a new controller flag 'NVME_CTRL_MARGINAL'
>> which will be checked during multipath path selection, causing the
>> path to be skipped when checking for 'optimized' paths. If no
>> optimized paths are available the 'marginal' paths are considered
>> for path selection alongside the 'non-optimized' paths.
>> It also introduces a new nvme-fc callback 'nvme_fc_fpin_rcv()' to
>> evaluate the FPIN LI TLV payload and set the 'marginal' state on
>> all affected rports.
>>
>> The testing for this patch set was performed by Bryan Gurney, using the
>> process outlined by John Meneghini's presentation at LSFMM 2024, where
>> the fibre channel switch sends an FPIN notification on a specific switch
>> port, and the following is checked on the initiator:
>>
>> 1. The controllers corresponding to the paths on the port that has
>> received the notification are showing a set NVME_CTRL_MARGINAL flag.
>>
>>     \
>>      +- nvme4 fc traddr=c,host_traddr=e live optimized
>>      +- nvme5 fc traddr=8,host_traddr=e live non-optimized
>>      +- nvme8 fc traddr=e,host_traddr=f marginal optimized
>>      +- nvme9 fc traddr=a,host_traddr=f marginal non-optimized
>>
>> 2. The I/O statistics of the test namespace show no I/O activity on the
>> controllers with NVME_CTRL_MARGINAL set.
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     25001.00         0.00        97.66         0.00
>>     nvme4c9n1     25000.00         0.00        97.66         0.00
>>     nvme4n1       50011.00         0.00       195.36         0.00
>>
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     48360.00         0.00       188.91         0.00
>>     nvme4c9n1      1642.00         0.00         6.41         0.00
>>     nvme4n1       49981.00         0.00       195.24         0.00
>>
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     50001.00         0.00       195.32         0.00
>>     nvme4c9n1         0.00         0.00         0.00         0.00
>>     nvme4n1       50016.00         0.00       195.38         0.00
>>
>> Link: https://people.redhat.com/jmeneghi/LSFMM_2024/LSFMM_2024_NVMe_Cancel_and_FPIN.pdf
>>
>> More rigorous testing was also performed to ensure proper path migration
>> on each of the eight different FPIN link integrity events, particularly
>> during a scenario where there are only non-optimized paths available, in
>> a state where all paths are marginal.  On a configuration with a
>> round-robin iopolicy, when all paths on the host show as marginal, I/O
>> continues on the optimized path that was most recently non-marginal.
>>  From this point, of both of the optimized paths are down, I/O properly
>> continues on the remaining paths.
>>
>> Changes to the original submission:
>> - Changed flag name to 'marginal'
>> - Do not block marginal path; influence path selection instead
>>    to de-prioritize marginal paths
>>
>> Changes to v2:
>> - Split off driver-specific modifications
>> - Introduce 'union fc_tlv_desc' to avoid casts
>>
>> Changes to v3:
>> - Include reviews from Justin Tee
>> - Split marginal path handling patch
>>
>> Changes to v4:
>> - Change 'u8' to '__u8' on fc_tlv_desc to fix a failure to build
>> - Print 'marginal' instead of 'live' in the state of controllers
>>    when they are marginal
>>
>> Changes to v5:
>> - Minor spelling corrections to patch descriptions
>>
>> Changes to v6:
>> - No code changes; added note about additional testing
>>
>> Hannes Reinecke (5):
>>    fc_els: use 'union fc_tlv_desc'
>>    nvme-fc: marginal path handling
>>    nvme-fc: nvme_fc_fpin_rcv() callback
>>    lpfc: enable FPIN notification for NVMe
>>    qla2xxx: enable FPIN notification for NVMe
>>
>> Bryan Gurney (1):
>>    nvme: sysfs: emit the marginal path state in show_state()
>>
>>   drivers/nvme/host/core.c         |   1 +
>>   drivers/nvme/host/fc.c           |  99 +++++++++++++++++++
>>   drivers/nvme/host/multipath.c    |  17 ++--
>>   drivers/nvme/host/nvme.h         |   6 ++
>>   drivers/nvme/host/sysfs.c        |   4 +-
>>   drivers/scsi/lpfc/lpfc_els.c     |  84 ++++++++--------
>>   drivers/scsi/qla2xxx/qla_isr.c   |   3 +
>>   drivers/scsi/scsi_transport_fc.c |  27 +++--
>>   include/linux/nvme-fc-driver.h   |   3 +
>>   include/uapi/scsi/fc/fc_els.h    | 165 +++++++++++++++++--------------
>>   10 files changed, 269 insertions(+), 140 deletions(-)
>>
>> --
>> 2.49.0
>>
> 
> 
> We're going to be working on follow-up patches to address some things
> that I found in additional testing:
> 
> During path fail testing on the numa iopolicy, I found that I/O moves
> off of the marginal path after a first link integrity event is
> received, but if the non-marginal path the I/O is on is disconnected,
> the I/O is transferred onto a marginal path (in testing, sometimes
> I've seen it go to a "marginal optimized" path, and sometimes
> "marginal non-optimized").
> 
That is by design.
'marginal' paths are only evaluated for the 'optimized' path selection,
where it's obvious that 'marginal' paths should not be selected as
'optimized'.
For 'non-optimized' the situation is less clear; is the 'non-optimized'
path preferrable to 'marginal'? Or the other way round?
So once the 'optimized' path selection returns no paths, _any_ of the
remaining paths are eligible.

> The queue-depth iopolicy doesn't change its path selection based on
> the marginal flag, but looking at nvme_queue_depth_path(), I can see
> that there's currently no logic to handle marginal paths.  We're
> developing a patch to address that issue in queue-depth, but we need
> to do more testing.
> 
Again, by design.
The whole point of the marginal path patchset is that I/O should
be steered away from the marginal path, but the path itself should
not completely shut off (otherwise we just could have declared the
path as 'faulty' and be done with).
Any I/O on 'marginal' paths should have higher latencies, and higher
latencies should result in higher queue depths on these paths. So
the queue-depth based IO scheduler should to the right thing
automatically.
Always assuming that marginal paths should have higher latencies.
If they haven't then they will be happily selection for I/O.
But then again, if the marginal path does _not_ have higher
latencies why shouldn't we select it for I/O?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list