[PATCH] nvme-tcp: delay error recovery after link drop
Sagi Grimberg
sagi at grimberg.me
Wed Sep 7 00:57:20 PDT 2022
> I think the problem Hannes is trying to address with this patch is: the
> Spec currently says:
>
> Base 2.0 3.3.2.4
>
> If an NVMe Transport connection is lost as a result of an NVMe Transport
> error, then before performing
> recovery actions related to commands sent on I/O queues associated with
> that NVMe Transport connection,
> the host should wait for at least the longer of:
>
> - the NVMe Keep Alive timeout; or
> - the underlying fabric transport timeout, if any.
>
> I'm not sure the NVMe/TCP host stack obeys this rule.
>
> The problem is, when the host fails to follow this rule, some NVMe/TCP
> controllers are found to corrupt data during cable pull tests.
The delay needs to come explicitly from the controller, not guessed
blindly by the host.
More information about the Linux-nvme
mailing list