[PATCH 0/3] nvme-tcp: start error recovery after KATO
Hannes Reinecke
hare at suse.de
Tue Sep 12 04:56:13 PDT 2023
On 9/12/23 13:51, Sagi Grimberg wrote:
>> Hi all,
>>
>> there have been some very insistent reports of data corruption
>> with certain target implementations due to command retries.
>
> None of which were reported on this list...
>
Correct. I can ask them to post their finding here if that would make a
difference.
>> Problem here is that for TCP we're starting error recovery
>> immediately after either a command timeout or a (local) link loss.
>
> It does so only in one occasion, when the user triggered a
> reset_controller. a command timeout is greater than the default
> kato (6 times in fact), was this the case where the issue was
> observed? If so, the timeout handler should probably just wait
> the kato remaining time.
>
Nothing to do with reset_controller.
The problem really is in nvme_tcp_state_change(), which will blindly
start error recovery (and a subsequent command retry) whenever the local
link drops. And that error recover does _not_ wait for KATO, but rather
causes an immediate retry.
> BTW, the same happens for rdma as well. Nothing should be
> tcp specific here afaict.
Oh, sure. Will be modifying the patch to include that.
Cheers,
Hannes
More information about the Linux-nvme
mailing list