[PATCH RFC] nvme-fc: FPIN link integrity handling
Hannes Reinecke
hare at suse.de
Thu Mar 7 04:13:08 PST 2024
On 3/7/24 13:01, Sagi Grimberg wrote:
>
>
> On 07/03/2024 13:29, Hannes Reinecke wrote:
>> On 3/7/24 11:10, Sagi Grimberg wrote:
>>>
>>>
>>> On 19/02/2024 10:59, hare at kernel.org wrote:
>>>> From: Hannes Reinecke <hare at suse.de>
>>>>
>>>> FPIN LI (link integrity) messages are received when the attached
>>>> fabric detects hardware errors. In response to these messages the
>>>> affected ports should not be used for I/O, and only put back into
>>>> service once the ports had been reset as then the hardware might
>>>> have been replaced.
>>>
>>> Does this mean it cannot service any type of communication over
>>> the wire?
>>>
>> It means that the service is impacted, and communication cannot be
>> guaranteed (CRC errors, packet loss, you name it).
>> So the link should be taken out of service until it's been (manually)
>> replaced.
>
> OK, that's what I assumed.
>
>>
>>>> This patch adds a new controller flag 'NVME_CTRL_TRANSPORT_BLOCKED'
>>>> which will be checked during multipath path selection, causing the
>>>> path to be skipped.
>>>
>>> While this looks sensible to me, it also looks like this introduces a
>>> ctrl state
>>> outside of ctrl->state... Wouldn't it make sense to move the
>>> controller to
>>> NVME_CTRL_DEAD ? or is it not a terminal state?
>>>
>> Actually, I was trying to model it alongside the
>> 'devloss_tmo'/'fast_io_fail' mechanism we have in SCSI.
>> Technically the controller is still present, it's just that we shouldn't
>> send I/O to it.
>
> Sounds like a dead controller to me.
>
Sort of, yes.
>> And I'd rather not disconnect here as we're trying to
>> do an autoconnect on FC, so manually disconnect would interfere with
>> that and we probably end in a death spiral doing disconnect/reconnect.
>
> I suggested just transitioning the state to DEAD... Not sure how
> keep-alives behave though...
>
Hmm. The state machine has the transition LIVE->DELETING->DEAD,
ie a dead controller is on the way out, with all resources being
reclaimed.
A direct transition would pretty much violate that.
If we were going that way I'd prefer to have another state
('IMPACTED' ? 'LIVE_NOIO' ?) with the transitions
LIVE->IMPACTED->DELETING->DEAD
>>
>> We could 'elevate' it to a new controller state, but wasn't sure how big
>> an appetite there is. And we already have flags like 'stopped' which
>> seem to fall into the same category.
>
> stopped is different because it is not used to determine if it is capable
> for IO (admin or io queues). Hence it is ok to be a flag.
>
Okay.
So yeah, we could introduce a new state, but I guess a direct transition
to 'DEAD' is not really a good idea.
Cheers,
Hannes
More information about the Linux-nvme
mailing list