[PATCH RFC] nvme-fc: FPIN link integrity handling

Hannes Reinecke hare at suse.de
Thu Mar 7 04:13:08 PST 2024


On 3/7/24 13:01, Sagi Grimberg wrote:
> 
> 
> On 07/03/2024 13:29, Hannes Reinecke wrote:
>> On 3/7/24 11:10, Sagi Grimberg wrote:
>>>
>>>
>>> On 19/02/2024 10:59, hare at kernel.org wrote:
>>>> From: Hannes Reinecke <hare at suse.de>
>>>>
>>>> FPIN LI (link integrity) messages are received when the attached
>>>> fabric detects hardware errors. In response to these messages the
>>>> affected ports should not be used for I/O, and only put back into
>>>> service once the ports had been reset as then the hardware might
>>>> have been replaced.
>>>
>>> Does this mean it cannot service any type of communication over
>>> the wire?
>>>
>> It means that the service is impacted, and communication cannot be 
>> guaranteed (CRC errors, packet loss, you name it).
>> So the link should be taken out of service until it's been (manually)
>> replaced.
> 
> OK, that's what I assumed.
> 
>>
>>>> This patch adds a new controller flag 'NVME_CTRL_TRANSPORT_BLOCKED'
>>>> which will be checked during multipath path selection, causing the
>>>> path to be skipped.
>>>
>>> While this looks sensible to me, it also looks like this introduces a 
>>> ctrl state
>>> outside of ctrl->state... Wouldn't it make sense to move the 
>>> controller to
>>> NVME_CTRL_DEAD ? or is it not a terminal state?
>>>
>> Actually, I was trying to model it alongside the 
>> 'devloss_tmo'/'fast_io_fail' mechanism we have in SCSI.
>> Technically the controller is still present, it's just that we shouldn't
>> send I/O to it.
> 
> Sounds like a dead controller to me.
> 
Sort of, yes.

>> And I'd rather not disconnect here as we're trying to
>> do an autoconnect on FC, so manually disconnect would interfere with
>> that and we probably end in a death spiral doing disconnect/reconnect.
> 
> I suggested just transitioning the state to DEAD... Not sure how 
> keep-alives behave though...
> 
Hmm. The state machine has the transition LIVE->DELETING->DEAD,
ie a dead controller is on the way out, with all resources being
reclaimed.

A direct transition would pretty much violate that.
If we were going that way I'd prefer to have another state
('IMPACTED' ? 'LIVE_NOIO' ?) with the transitions
LIVE->IMPACTED->DELETING->DEAD

>>
>> We could 'elevate' it to a new controller state, but wasn't sure how big
>> an appetite there is. And we already have flags like 'stopped' which
>> seem to fall into the same category.
> 
> stopped is different because it is not used to determine if it is capable
> for IO (admin or io queues). Hence it is ok to be a flag.
> 
Okay.

So yeah, we could introduce a new state, but I guess a direct transition
to 'DEAD' is not really a good idea.

Cheers,

Hannes




More information about the Linux-nvme mailing list