[PATCH] nvme-tcp: delay error recovery after link drop

Thu Jul 14 09:07:05 PDT 2022

>>> When the connection unexpectedly closes we must not start error
>>> recovery right away, as the controller might not have registered
>>> the connection failure and so retrying commands directly will
>>> might lead to data corruption.
>>> So wait for KATO before starting error recovery to be on the safe
>>> side; chances are the commands will time out before that anyway.
>>
>> We can't just blindly adding kato to the error recovery because that
>> by definition creates a intrinsic delay for I/O failover. There is
>> absolutely no reason what-so-ever to do that. kato can be arbitrarily
>> long.
>>
> Yes, true. But the controller might need to register the connection 
> timeout and do some cleanup action afterwards, for which KATO is the 
> only value we have currently.

Yes, but it is orthogonal to the problem you are describing. This is
tying two things that are unrelated. Plus, kato is controlled by the
user and now it affects the failover latency? kato can be 1s, is that
enough? no one knows... its all voodoo.

>> If the controller needs this delay, then it should signal this
>> somehow in its capabilities. This is not something that we can just
>> blindly do. So this needs to go through the TWG.
>>
> ... as you might be aware, this is discussed at the TWG.

I remember discussing this once, and was against doing this
unconditionally. This is by definition a result of a controller
implementation. There are controllers out there that know how
to fence against the issue you are describing and there is no
reason that the host will failover after kato for no reason.

If the controller needs it, the controller needs to reflect
it to the host.

> There even is a TPAR 4129 which deals with precisely this issue.
> But consensus is that an immediate retry on path failure is dangerous, 
> as the controller might not have detected the path failure yet, and will 
> wait for at least KATO before declaring a path dead and start recovery 
> actions, potentially clearing out stale/stuck commands.
> And retry on another path during that time is dangerous.

Dangerous is not enough to universally do this unfortunately. If for
some reason that is unclear to me, there is resistance to have the
controller reflect what it needs, then add a failover_tmo parameter
that is defaulting to 0. The controllers that need a different timeout
then the user can explicitly state it.

> And yes, we do have customers who have seen this in real life.
> 
>> Also, how come this is TCP specific anyway? This talks about a
>> controller that has a dangling inflight command that he cannot fence
>> yet hence cannot serve failover I/O. RDMA should have the same thing.
>>
> The crucial bit here is the 'nvme_tcp_state_change()' callback, which 
> will trigger recovery as soon as it detects a connection failure.
> 
> RDMA doesn't have a direct equivalent here, and I'm not that deep into 
> RDMA details to know if the RDMA_CM_EVENT_DISCONNECTED is synchronized 
> with the remote side. If it is, we're fine. If not, we'll have the same 
> issue.

RDMA_CM_EVENT_DISCONNECTED is unrelated to what you are describing, the
rdma_cm event you are referring to is RDMA_CM_EVENT_TIMEWAIT_EXIT (i.e.
the host send disconnect request and timed out). At least it is the case
afair.