[PATCH 0/3] nvme-tcp: start error recovery after KATO

Hannes Reinecke hare at suse.de
Tue Sep 12 04:56:13 PDT 2023


On 9/12/23 13:51, Sagi Grimberg wrote:
>> Hi all,
>>
>> there have been some very insistent reports of data corruption
>> with certain target implementations due to command retries.
> 
> None of which were reported on this list...
> 
Correct. I can ask them to post their finding here if that would make a 
difference.

>> Problem here is that for TCP we're starting error recovery
>> immediately after either a command timeout or a (local) link loss.
> 
> It does so only in one occasion, when the user triggered a
> reset_controller. a command timeout is greater than the default
> kato (6 times in fact), was this the case where the issue was
> observed? If so, the timeout handler should probably just wait
> the kato remaining time.
> 
Nothing to do with reset_controller.
The problem really is in nvme_tcp_state_change(), which will blindly 
start error recovery (and a subsequent command retry) whenever the local 
link drops. And that error recover does _not_ wait for KATO, but rather
causes an immediate retry.

> BTW, the same happens for rdma as well. Nothing should be
> tcp specific here afaict.

Oh, sure. Will be modifying the patch to include that.

Cheers,

Hannes




More information about the Linux-nvme mailing list