[LSF/MM/BPF TOPIC] TP4129 KATO correctoins and clarification

Fri Feb 7 09:14:10 PST 2025

On 2/7/25 4:00 AM, Sagi Grimberg wrote:
>> There was quite a bit of work to go through scenarios, and timer processing, to
>> address the concerns that the timer was a "best effort" (to reuse the term)
>> rather than a reliable method. I'm not aware of a scenario where applying the
>> timers, with the process described in TP4129, fails to provide a reliable
>> assurance that it is safe to retry write commands.
> 
> Unless the controller is able to reliably upper bound its abort operation of all inflight
> commands I can't see how this is 100% reliable. Which will likely mean that a healthy
> margin is taken such that it will not break in practice.

Who said the controller is not able to reliably abort the operations of all inflight commands?

TP4129 doesn't say this.

The whole point of TP4129 is that it provides something that is missing from the KATO protocol.
What's missing is exactly what you are asking about: "what's the amount of time the controller
needs to reliably abort all inflight commands"?

KATO is a heart beat. It is an error detection mechanism. KATO defines how long the controller
needs to wait before making the decision to BEGIN the process of internally aborting all inflight
IO. The problem is, the current KATO protocol does not define how long the controller takes to
FINISH the process of internally aborting all inflight IO. This is something that is controller
specific and this is what CQT was designed to communicate.

The current KATO protocol is broken because is confuses the error detection time with the error
recovery time. The idea that a controller "should" have run down all inflight IO by the time
KATO is detected is a completely arbitrary assumption. It is an assumption in the KATO
specification and it is broken. TP4129 addresses this assumption. It improves and corrects
the KATO protocol by providing a mechanism for the controller to communicate to the host what
the controller specific error recovery time is through the CQT.

As to whether or not the CQT is reliable... that's not our concern. We have this same issue in
other parts of the protocol. For example, the ANA ATT.  The ANA Transition Time communicates
a Service Level Agreement to the host, and if the the host sees that the ANA state does not
finish transitioning within the ATT, the results can be catastrophic. But in that case,
the problem is in the controller, not the host.

The same thing is true for the CQT. CQT is a service level agreement. So, if the host replays
an IO after CQT, and the controller screws up and corrupts data... not the Host's fault. The problem
is the controller. But with out CQT we can't say this. The host will always be suspect.
Without a clear protocol that provides a clear SLA, we can't hold things together.

Right now we have NVMe/TCP controllers out there that experience data corruption during their cable
pull tests because of this problem. These are shipping products that are supported by VmWare and
Microsoft, etc., however, they are not supported by Linux. This problem was root caused and fixed by
patches[1] that Hannes proposed back in 2023.

[1] https://lore.kernel.org/linux-nvme/20230908100049.80809-1-hare@suse.de/

Sagi, you NAKed these patches. At that time both Hannes and I talked with you about this
and we agreed that we would wait for TP4129 to address this problem.

Since that time some ditros have actually shipped these patches, out of tree, and I
am told by my storage partners that these patches fixed their data corruption problem.
In fact, Red Hat has been asked several times to include these patches in RHEL, but we've
been pushing back on our storage partners and telling them to wait for TP4129.

So to be told now that TP4129 is bogus and won't be implemented is more than a little madding.

At the end of the day, I don't care about TP4129.  Linux needs to fix the problem. Red Hat is
seeing NVMe/TCP adoption begin to pick up. More and more of our customers are trying NVMe/TCP
and using NVMe/TCP. If we want NVMe/TCP adoption to continue, we need this data corruption
problem to be fixed upstream.  If you don't like TP4129, then ACK Hannes's patches.

/John