[PATCH 0/3] nvme-tcp: start error recovery after KATO

Mon Dec 4 02:30:45 PST 2023

>>>>>> Hi all,
>>>>>>
>>>>>> there have been some very insistent reports of data corruption
>>>>>> with certain target implementations due to command retries.
>>>>>
>>>>> None of which were reported on this list...
>>>>>
>>>> Correct. I can ask them to post their finding here if that would 
>>>> make a difference.
>>>>
>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>
>>>>> It does so only in one occasion, when the user triggered a
>>>>> reset_controller. a command timeout is greater than the default
>>>>> kato (6 times in fact), was this the case where the issue was
>>>>> observed? If so, the timeout handler should probably just wait
>>>>> the kato remaining time.
>>>>>
>>>> Nothing to do with reset_controller.
>>>> The problem really is in nvme_tcp_state_change(), which will blindly 
>>>> start error recovery (and a subsequent command retry) whenever the 
>>>> local link drops.
>>>
>>> That is incorrect. A link drop does not trigger a socket state change.
>>> a state change is triggered for specific socket states
>>> (see net/ipv4/tcp.c for description of socket states).
>>>
>>> These are a set of states that describes local and remote sides
>>> shutting down (or a TIME_WAIT). The cases where the host
>>> triggers error recovery without the controller acknowledging it
>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>> when the user triggers a controller reset/disconnect.
>>>
>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>> which should indicate the controller acknowledged the shutdown.
>>>
>>
>> Ah, now. It's really neither.
>> Complaint from the customer was that we are retrying commands without 
>> waiting for KATO. Actual claim was that we're retrying more-or-less 
>> immediately.
>>
>> So, the problem here is not the first command running into a timeout;
>> the problem really is with the _subsequent_ commands running into a 
>> timeout (cf my comments to patch 1/3).
>> The first command will set the status to RESETTING and start error 
>> recovery, all fine. But all _other_ commands will be aborted directly,
>> causing them to be retried immediately without any delay.
> 
> This is still unclear to me.
> The default io timeout is 30s, which is 6 times larger than the
> default kato. So effectively we are hitting a case where we waited
> 6 kato periods, and with your patch we are waiting 7 kato periods?
> 
> There is a disconnect for me here.
> 
> Can you please clarify:
> 1. what is your kato?
> 2. what is your io timeout?
> 3. was the controller RESETTING state triggered from kato? or
>     from io timeout?
>     if the latter, it indicates that kato is higher than io timeout,
>     otherwise the transport is perfectly fine, and this is essentially
>     a controller io that is taking longer than the host io timeout?
> 
> Do you have some trace that points out how a command is sent down to
> the controller (successfully arrives to the controller) the network goes
> down, and the io is retried before kato expires?
> 
> What is unclear to me here, is how kato expires, but the controller
> continues to process commands.
> 
>>>> And that error recover does _not_ wait for KATO, but rather
>>>> causes an immediate retry.
>>>
>>> We can't blindly wait for KATO. If the controller saw the socket
>>> closure, there is no point waiting kato.
>>>
>> Agreed. So we should be marking out FIN_WAIT1 as that one requiring us
>> to wait for KATO.
> 
> You can't do that, that will wait on FIN_WAIT1 for every queue.

Hey Hannes,

Can you comment where this went?

I'd like us to see how to progress to resolve this (assuming this
is still an issue).