[PATCH] nvme-tcp: delay error recovery after link drop

Wed Sep 7 00:57:20 PDT 2022

> I think the problem Hannes is trying to address with this patch is: the 
> Spec currently says:
> 
> Base 2.0 3.3.2.4
> 
> If an NVMe Transport connection is lost as a result of an NVMe Transport 
> error, then before performing
> recovery actions related to commands sent on I/O queues associated with 
> that NVMe Transport connection,
> the host should wait for at least the longer of:
> 
> - the NVMe Keep Alive timeout; or
> - the underlying fabric transport timeout, if any.
> 
> I'm not sure the NVMe/TCP host stack obeys this rule.
> 
> The problem is, when the host fails to follow this rule, some NVMe/TCP 
> controllers are found to corrupt data during cable pull tests.

The delay needs to come explicitly from the controller, not guessed
blindly by the host.