[PATCH 0/3] nvme-tcp: start error recovery after KATO

Mon Dec 4 06:15:23 PST 2023

On 12/4/23 14:27, Sagi Grimberg wrote:
> 
> 
> On 12/4/23 13:37, Hannes Reinecke wrote:
>> On 12/4/23 11:30, Sagi Grimberg wrote:
>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> there have been some very insistent reports of data corruption
>>>>>>>>> with certain target implementations due to command retries.
>>>>>>>>
>>>>>>>> None of which were reported on this list...
>>>>>>>>
>>>>>>> Correct. I can ask them to post their finding here if that would 
>>>>>>> make a difference.
>>>>>>>
>>>>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>>>>
>>>>>>>> It does so only in one occasion, when the user triggered a
>>>>>>>> reset_controller. a command timeout is greater than the default
>>>>>>>> kato (6 times in fact), was this the case where the issue was
>>>>>>>> observed? If so, the timeout handler should probably just wait
>>>>>>>> the kato remaining time.
>>>>>>>>
>>>>>>> Nothing to do with reset_controller.
>>>>>>> The problem really is in nvme_tcp_state_change(), which will 
>>>>>>> blindly start error recovery (and a subsequent command retry) 
>>>>>>> whenever the local link drops.
>>>>>>
>>>>>> That is incorrect. A link drop does not trigger a socket state 
>>>>>> change.
>>>>>> a state change is triggered for specific socket states
>>>>>> (see net/ipv4/tcp.c for description of socket states).
>>>>>>
>>>>>> These are a set of states that describes local and remote sides
>>>>>> shutting down (or a TIME_WAIT). The cases where the host
>>>>>> triggers error recovery without the controller acknowledging it
>>>>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>>>>> when the user triggers a controller reset/disconnect.
>>>>>>
>>>>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>>>>> which should indicate the controller acknowledged the shutdown.
>>>>>>
>>>>>
>>>>> Ah, now. It's really neither.
>>>>> Complaint from the customer was that we are retrying commands 
>>>>> without waiting for KATO. Actual claim was that we're retrying 
>>>>> more-or-less immediately.
>>>>>
>>>>> So, the problem here is not the first command running into a timeout;
>>>>> the problem really is with the _subsequent_ commands running into a 
>>>>> timeout (cf my comments to patch 1/3).
>>>>> The first command will set the status to RESETTING and start error 
>>>>> recovery, all fine. But all _other_ commands will be aborted directly,
>>>>> causing them to be retried immediately without any delay.
>>>>
>>>> This is still unclear to me.
>>>> The default io timeout is 30s, which is 6 times larger than the
>>>> default kato. So effectively we are hitting a case where we waited
>>>> 6 kato periods, and with your patch we are waiting 7 kato periods?
>>>>
>>>> There is a disconnect for me here.
>>>>
>>>> Can you please clarify:
>>>> 1. what is your kato?
>>>> 2. what is your io timeout?
>>>> 3. was the controller RESETTING state triggered from kato? or
>>>>     from io timeout?
>>>>     if the latter, it indicates that kato is higher than io timeout,
>>>>     otherwise the transport is perfectly fine, and this is essentially
>>>>     a controller io that is taking longer than the host io timeout?
>>>>
>>>> Do you have some trace that points out how a command is sent down to
>>>> the controller (successfully arrives to the controller) the network 
>>>> goes
>>>> down, and the io is retried before kato expires?
>>>>
>>>> What is unclear to me here, is how kato expires, but the controller
>>>> continues to process commands.
>>>>
>>
>> Argument here is that there is a race window between command timeout
>> triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
>> 'err_work' workqueue function. If you have a burst of commands all
>> expiring at roughly the same time the first command will set the 
>> controller state to 'RESETTING', and schedule the 'err_work' workqueue
>> function.
> 
> Yes, that is correct. It is also the keep-alive itself that can
> trigger the timeout handler. In fact, it is the more likely one
> because it will by default have a lower timeout than IO commands.
> 
> This is a part of where I lose the trail here, because I'd expect
> the keep-alive to expire far before any of the IO commands, especially
> with the latest changes to expedite the keep-alive execution pace.
> 
That's not necessarily the case. We don't have any checks to align 
command timeouts to KATO, and we don't mandate anywhere that command
timeouts need to be larger than KATO.
And even then we can get pathological cases where commands take up
to the timeout for processing, and the last keep-alive was just sent
before that.
So I'd rather prefer to fix the race condition.

>> The second command will timing out _before_ err_work is scheduled,
>> will see the RESETTING state in nvme_tcp_timeout(), and directly
>> complete and retry (via nvme_tcp_complete_timed_out()) without
>> waiting for KATO.
> 
> That is something we probably can address by analyzing the controller
> state a bit deeper. We've always failed IO commands quickly in the
> timeout handler because these commands can be part of the ctrl setup
> sequence, which means the error recovery will not run and fail the 
> commands on their behalf reliably. I also think there were issues with
> issues in the teardown stage (shutdown or disable).
> 
Sure. That's what I tried to address with the first patch.

>> That is the issue the customer had been seing, and that is what is
>> causing data corruption on their end.
>>
>> This patchset is an attempt to fix this issue.
> 
> In your patch 3, you take nvme_keep_alive_work_period(), which is
> either kato/2 or kato/4. It is not clear to me why these values
> were chosen other than "the issue went away".
> 
Yeah, that is wrong. The idea is to ensure that command timeout _always_ 
start error recovery at least after the KATO period expired, but that an 
actual KATO timeout would schedule the error handling directly (ie 
flushing the scheduled error recovery workqueue item).
That way the KATO timeout is the upper limit, but will be started 
earlier depending on the timing between command and keep-alive timeout.

> Also, what happens if kato (which is user configurable), is arbitrarily
> long? This means that IO failover can take say 60 minutes? Do you think
> its reasonable? This is where I'm concerned about the unconditional
> delay.
> 
Yes, unfortunately that's true. But that's what's mandated by the spec.
We could add an option to stick to the original behaviour, though, as 
one could argue that this is what users are expecting now.

> I think that the idea was that an appropriate delay would be indicated
> by the controller, rather than being determined by a user controlled
> variable.
> 
Yep. The standard will be updated to cover that, but we're not there yet.

> Regardless we should limit to exactly when such a delay is actually
> needed.
> 
> Plus, I haven't seen anything for nvme-rdma and it is unclear to me
> why this is a host specific to tcp.
> 
That's because the testing had been done on TCP only, and I wanted to
resolve the discussion before posting a full patchset.

>> We'd be happy to pick up this thread, it just had been pushed back
>> internally as we put in a band-aid for that customer in our kernels...
> 
> Well, my assumption is that we want this addressed upstream. Just a
> matter of how.

Agreed.

Cheers,

Hannes