[PATCH 0/3] nvme-tcp: start error recovery after KATO

Mon Dec 4 08:11:29 PST 2023

>>> Argument here is that there is a race window between command timeout
>>> triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
>>> 'err_work' workqueue function. If you have a burst of commands all
>>> expiring at roughly the same time the first command will set the 
>>> controller state to 'RESETTING', and schedule the 'err_work' workqueue
>>> function.
>>
>> Yes, that is correct. It is also the keep-alive itself that can
>> trigger the timeout handler. In fact, it is the more likely one
>> because it will by default have a lower timeout than IO commands.
>>
>> This is a part of where I lose the trail here, because I'd expect
>> the keep-alive to expire far before any of the IO commands, especially
>> with the latest changes to expedite the keep-alive execution pace.
>>
> That's not necessarily the case. We don't have any checks to align 
> command timeouts to KATO, and we don't mandate anywhere that command
> timeouts need to be larger than KATO.

Was this the case? where io timeout lower than kato?
What were these values in reproducer?

> And even then we can get pathological cases where commands take up
> to the timeout for processing, and the last keep-alive was just sent
> before that.
> So I'd rather prefer to fix the race condition.

My assumption was that the connection (transport) was torn, not that
the controller processing time exceeded command timeout.

>>> The second command will timing out _before_ err_work is scheduled,
>>> will see the RESETTING state in nvme_tcp_timeout(), and directly
>>> complete and retry (via nvme_tcp_complete_timed_out()) without
>>> waiting for KATO.
>>
>> That is something we probably can address by analyzing the controller
>> state a bit deeper. We've always failed IO commands quickly in the
>> timeout handler because these commands can be part of the ctrl setup
>> sequence, which means the error recovery will not run and fail the 
>> commands on their behalf reliably. I also think there were issues with
>> issues in the teardown stage (shutdown or disable).
>>
> Sure. That's what I tried to address with the first patch.

Don't think we need to look at the running state of the err_work (looks
racy to me). Perhaps we can suffice on the ctrl state alone, but we also
need to check that it is an IO command, so we can address when the
setup/teardown admin commands can still fail without relying on the
error handler.

But overall I'm fine with this patch (1/3).

>>> That is the issue the customer had been seing, and that is what is
>>> causing data corruption on their end.
>>>
>>> This patchset is an attempt to fix this issue.
>>
>> In your patch 3, you take nvme_keep_alive_work_period(), which is
>> either kato/2 or kato/4. It is not clear to me why these values
>> were chosen other than "the issue went away".
>>
> Yeah, that is wrong. The idea is to ensure that command timeout _always_ 
> start error recovery at least after the KATO period expired, but that an 
> actual KATO timeout would schedule the error handling directly (ie 
> flushing the scheduled error recovery workqueue item).
> That way the KATO timeout is the upper limit, but will be started 
> earlier depending on the timing between command and keep-alive timeout.

So the correct patch was to delay the error handler for kato?

>> Also, what happens if kato (which is user configurable), is arbitrarily
>> long? This means that IO failover can take say 60 minutes? Do you think
>> its reasonable? This is where I'm concerned about the unconditional
>> delay.
>>
> Yes, unfortunately that's true. But that's what's mandated by the spec.
> We could add an option to stick to the original behaviour, though, as 
> one could argue that this is what users are expecting now.

The only way kato is an option to add a delay is only if we prevent or
limit what users set it to. That is a discussion others should comment
on.

>> I think that the idea was that an appropriate delay would be indicated
>> by the controller, rather than being determined by a user controlled
>> variable.
>>
> Yep. The standard will be updated to cover that, but we're not there yet.
> 
>> Regardless we should limit to exactly when such a delay is actually
>> needed.
>>
>> Plus, I haven't seen anything for nvme-rdma and it is unclear to me
>> why this is a host specific to tcp.
>>
> That's because the testing had been done on TCP only, and I wanted to
> resolve the discussion before posting a full patchset.

I see.

>>> We'd be happy to pick up this thread, it just had been pushed back
>>> internally as we put in a band-aid for that customer in our kernels...
>>
>> Well, my assumption is that we want this addressed upstream. Just a
>> matter of how.
> 
> Agreed.
> 
> Cheers,
> 
> Hannes
>