[PATCH RFC] nvme-fc: FPIN link integrity handling
Sagi Grimberg
sagi at grimberg.me
Fri Mar 8 02:38:44 PST 2024
On 07/03/2024 14:13, Hannes Reinecke wrote:
> On 3/7/24 13:01, Sagi Grimberg wrote:
>>
>>
>> On 07/03/2024 13:29, Hannes Reinecke wrote:
>>> On 3/7/24 11:10, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 19/02/2024 10:59, hare at kernel.org wrote:
>>>>> From: Hannes Reinecke <hare at suse.de>
>>>>>
>>>>> FPIN LI (link integrity) messages are received when the attached
>>>>> fabric detects hardware errors. In response to these messages the
>>>>> affected ports should not be used for I/O, and only put back into
>>>>> service once the ports had been reset as then the hardware might
>>>>> have been replaced.
>>>>
>>>> Does this mean it cannot service any type of communication over
>>>> the wire?
>>>>
>>> It means that the service is impacted, and communication cannot be
>>> guaranteed (CRC errors, packet loss, you name it).
>>> So the link should be taken out of service until it's been (manually)
>>> replaced.
>>
>> OK, that's what I assumed.
>>
>>>
>>>>> This patch adds a new controller flag 'NVME_CTRL_TRANSPORT_BLOCKED'
>>>>> which will be checked during multipath path selection, causing the
>>>>> path to be skipped.
>>>>
>>>> While this looks sensible to me, it also looks like this introduces
>>>> a ctrl state
>>>> outside of ctrl->state... Wouldn't it make sense to move the
>>>> controller to
>>>> NVME_CTRL_DEAD ? or is it not a terminal state?
>>>>
>>> Actually, I was trying to model it alongside the
>>> 'devloss_tmo'/'fast_io_fail' mechanism we have in SCSI.
>>> Technically the controller is still present, it's just that we
>>> shouldn't
>>> send I/O to it.
>>
>> Sounds like a dead controller to me.
>>
> Sort of, yes.
>
>>> And I'd rather not disconnect here as we're trying to
>>> do an autoconnect on FC, so manually disconnect would interfere with
>>> that and we probably end in a death spiral doing disconnect/reconnect.
>>
>> I suggested just transitioning the state to DEAD... Not sure how
>> keep-alives behave though...
>>
> Hmm. The state machine has the transition LIVE->DELETING->DEAD,
> ie a dead controller is on the way out, with all resources being
> reclaimed.
>
> A direct transition would pretty much violate that.
> If we were going that way I'd prefer to have another state
> ('IMPACTED' ? 'LIVE_NOIO' ?) with the transitions
> LIVE->IMPACTED->DELETING->DEAD
>
>>>
>>> We could 'elevate' it to a new controller state, but wasn't sure how
>>> big
>>> an appetite there is. And we already have flags like 'stopped' which
>>> seem to fall into the same category.
>>
>> stopped is different because it is not used to determine if it is
>> capable
>> for IO (admin or io queues). Hence it is ok to be a flag.
>>
> Okay.
>
> So yeah, we could introduce a new state, but I guess a direct transition
> to 'DEAD' is not really a good idea.
How common do you think this state would be? On the one hand, having a
generic
state that the transport is kept a live but simply refuses to accept I/O
sounds like
a generic state, but I can't think of an equivalent in the other transports.
If this is something that is private to FC, perhaps the right way is to
add a flag
for it that only fc sets, and when a second usage of it appears, we
promote it
to a proper controller state. Thoughts?
More information about the Linux-nvme
mailing list