[PATCH] nvme-tcp: delay error recovery after link drop

Thu Jul 14 08:15:56 PDT 2022

On 7/14/22 16:42, Sagi Grimberg wrote:
> 
>> When the connection unexpectedly closes we must not start error
>> recovery right away, as the controller might not have registered
>> the connection failure and so retrying commands directly will
>> might lead to data corruption.
>> So wait for KATO before starting error recovery to be on the safe
>> side; chances are the commands will time out before that anyway.
> 
> We can't just blindly adding kato to the error recovery because that
> by definition creates a intrinsic delay for I/O failover. There is
> absolutely no reason what-so-ever to do that. kato can be arbitrarily
> long.
> 
Yes, true. But the controller might need to register the connection 
timeout and do some cleanup action afterwards, for which KATO is the 
only value we have currently.

> If the controller needs this delay, then it should signal this
> somehow in its capabilities. This is not something that we can just
> blindly do. So this needs to go through the TWG.
> 
... as you might be aware, this is discussed at the TWG.
There even is a TPAR 4129 which deals with precisely this issue.
But consensus is that an immediate retry on path failure is dangerous, 
as the controller might not have detected the path failure yet, and will 
wait for at least KATO before declaring a path dead and start recovery 
actions, potentially clearing out stale/stuck commands.
And retry on another path during that time is dangerous.

And yes, we do have customers who have seen this in real life.

> Also, how come this is TCP specific anyway? This talks about a
> controller that has a dangling inflight command that he cannot fence
> yet hence cannot serve failover I/O. RDMA should have the same thing.
> 
The crucial bit here is the 'nvme_tcp_state_change()' callback, which 
will trigger recovery as soon as it detects a connection failure.

RDMA doesn't have a direct equivalent here, and I'm not that deep into 
RDMA details to know if the RDMA_CM_EVENT_DISCONNECTED is synchronized 
with the remote side. If it is, we're fine. If not, we'll have the same 
issue.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman