[PATCH RFC] nvme-fc: FPIN link integrity handling

Sagi Grimberg sagi at grimberg.me
Thu Mar 7 04:01:21 PST 2024



On 07/03/2024 13:29, Hannes Reinecke wrote:
> On 3/7/24 11:10, Sagi Grimberg wrote:
>>
>>
>> On 19/02/2024 10:59, hare at kernel.org wrote:
>>> From: Hannes Reinecke <hare at suse.de>
>>>
>>> FPIN LI (link integrity) messages are received when the attached
>>> fabric detects hardware errors. In response to these messages the
>>> affected ports should not be used for I/O, and only put back into
>>> service once the ports had been reset as then the hardware might
>>> have been replaced.
>>
>> Does this mean it cannot service any type of communication over
>> the wire?
>>
> It means that the service is impacted, and communication cannot be 
> guaranteed (CRC errors, packet loss, you name it).
> So the link should be taken out of service until it's been (manually)
> replaced.

OK, that's what I assumed.

>
>>> This patch adds a new controller flag 'NVME_CTRL_TRANSPORT_BLOCKED'
>>> which will be checked during multipath path selection, causing the
>>> path to be skipped.
>>
>> While this looks sensible to me, it also looks like this introduces a 
>> ctrl state
>> outside of ctrl->state... Wouldn't it make sense to move the 
>> controller to
>> NVME_CTRL_DEAD ? or is it not a terminal state?
>>
> Actually, I was trying to model it alongside the 
> 'devloss_tmo'/'fast_io_fail' mechanism we have in SCSI.
> Technically the controller is still present, it's just that we shouldn't
> send I/O to it.

Sounds like a dead controller to me.

> And I'd rather not disconnect here as we're trying to
> do an autoconnect on FC, so manually disconnect would interfere with
> that and we probably end in a death spiral doing disconnect/reconnect.

I suggested just transitioning the state to DEAD... Not sure how keep-alives
behave though...

>
> We could 'elevate' it to a new controller state, but wasn't sure how big
> an appetite there is. And we already have flags like 'stopped' which
> seem to fall into the same category.

stopped is different because it is not used to determine if it is capable
for IO (admin or io queues). Hence it is ok to be a flag.

>
> So I'd rather not touch the state machine.

I'm just trying to understand if it falls into one of the existing 
states that
we have today. Because it sounds awfully like DEAD to me.



More information about the Linux-nvme mailing list