[PATCH 0/3] nvme-tcp: start error recovery after KATO

Hannes Reinecke hare at suse.de
Tue Sep 12 06:25:23 PDT 2023


On 9/12/23 14:12, Sagi Grimberg wrote:
> 
>>>> Hi all,
>>>>
>>>> there have been some very insistent reports of data corruption
>>>> with certain target implementations due to command retries.
>>>
>>> None of which were reported on this list...
>>>
>> Correct. I can ask them to post their finding here if that would make 
>> a difference.
>>
>>>> Problem here is that for TCP we're starting error recovery
>>>> immediately after either a command timeout or a (local) link loss.
>>>
>>> It does so only in one occasion, when the user triggered a
>>> reset_controller. a command timeout is greater than the default
>>> kato (6 times in fact), was this the case where the issue was
>>> observed? If so, the timeout handler should probably just wait
>>> the kato remaining time.
>>>
>> Nothing to do with reset_controller.
>> The problem really is in nvme_tcp_state_change(), which will blindly 
>> start error recovery (and a subsequent command retry) whenever the 
>> local link drops.
> 
> That is incorrect. A link drop does not trigger a socket state change.
> a state change is triggered for specific socket states
> (see net/ipv4/tcp.c for description of socket states).
> 
> These are a set of states that describes local and remote sides
> shutting down (or a TIME_WAIT). The cases where the host
> triggers error recovery without the controller acknowledging it
> (in case the link is down) are FIN_WAIT1/2, and these happen
> when the user triggers a controller reset/disconnect.
> 
> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
> which should indicate the controller acknowledged the shutdown.
> 

Ah, now. It's really neither.
Complaint from the customer was that we are retrying commands without 
waiting for KATO. Actual claim was that we're retrying more-or-less 
immediately.

So, the problem here is not the first command running into a timeout;
the problem really is with the _subsequent_ commands running into a 
timeout (cf my comments to patch 1/3).
The first command will set the status to RESETTING and start error 
recovery, all fine. But all _other_ commands will be aborted directly,
causing them to be retried immediately without any delay.

>> And that error recover does _not_ wait for KATO, but rather
>> causes an immediate retry.
> 
> We can't blindly wait for KATO. If the controller saw the socket
> closure, there is no point waiting kato.
> 
Agreed. So we should be marking out FIN_WAIT1 as that one requiring us
to wait for KATO.

> If the user sets KATO to be 3 hours, and does a controller reset,
> what happens now?
> 
See above. I'd be fine with delaying error recovery for FIN_WAIT1 only.

> Only when the host triggers the shutdown, and that is not acknowledged
> by the controller, then yes, the host needs to wait the remaining of
> kato.

I'll be modifying the patchset to updating the description for the first
patch and only delaying error recovery for FIN_WAIT1.

Cheers,

Hannes




More information about the Linux-nvme mailing list