[LSF/MM/BPF TOPIC] TP4129 KATO correctoins and clarification

Tue Feb 4 16:18:22 PST 2025

On 1/31/25 4:32 AM, Christoph Hellwig wrote:
> On Fri, Jan 31, 2025 at 10:24:31AM +0100, Daniel Wagner wrote:
>> Hi,
>>
>> The KATO handling in the spec got updated via the TP4129. In short it
>> mandatas that failing requests should be delayed by some value before
>> retried on a different path.
> 
> TP4129 is broken and the TWG has been told that.  There is absolutely no
> point in either implementing it or rehashing the discussion over and

If by saying "TP4129 is broken" you mean that it is a another timer based mechanism, I think you need to get over it.
We are talking about a fabric. And in any fabric the error recovery mechanism of last resort will ALWAYS be a timeout.

And if by saying "the TWG has been told that" you mean that you've expressed YOUR opinion that TP8028 is a better
solution than TP4129 because it involves no additional timer, I think you need to get over it again. We all get your
opinion but that opinion can't dictate the volition of the industry.  TP4129 exists because there is code running
out there in controllers that fails to interoperate with Linux.  This TP is not some new feature like NVMe/TLS or FDP.
TP4129 is a repair to a broken protocol, the KATO protocol.

> over again.  Concentrate on making 8028 useful instead.  And if you want
> a discussion, "why do people people push stupid things though the
> NVMe TWG and then comaplain later about the lack of implementations"
> might be more useful.

I'm sorry but you are going to have to give us a little more than insults and name calling to
convince me that TP4129 is a piece of crap and not worth implementing.

How is it, exactly, that TP4129 is broken?

Before TP4129 the spec said nothing about controller requirements following a timeout and the linux host code assumed
it was safe to replay IOs immediately after a keep alive timeout period. This problem is exasperated by the fact that
KATO is a negotiable parameter that can be changed by the the user, on the command line, with out regard to any requirement
of the controller.

For example, from the v 2.0b spec:

   If a Keep Alive Timer expires:

   a) the controller shall:
      * record an Error Information Log Entry with the status code Keep Alive Timeout Expired,
      * stop processing commands;
      * set the Controller Fatal Status (CSTS.CFS) bit to ‘1’; and
        for message-based NVMe Transports:
        - terminate the NVMe Transport connection; and
        - break the host to controller association;

   and

   b) the host assumes all outstanding commands are not completed and re-issues
      commands as appropriate.

This is one place where the current KATO spec is inadequate. Unlike the FCP standard or FC-NVMe specification the NVMe 2.0 spec.
provides no guidance, whatsoever, about how or when commands should be replayed following a keep alive timeout. This was a major
defect in the NVMe specification that led to assumptions in both the host and controller implementations we have out there.
Those assumptions led to running code in shipping products that fails to interoperate and this is is addressed by TP4129.

For the uninitiated, simply put, TP4129 expands point b) above by providing an optional controller based Command Quiesce
Time (CQT) which advises the host how long to wait before replaying any command assumed to be completed following a KATO.

This timer is needed because different storage arrays require a different additional timeout or delay period to abort
commands that are internally in-flight following a keep alive timeout. At Red Hat multiple storage array vendors have reported
that, while testing the keep alive timeout mechanism, they are experiencing data corruption. These data corruption issues have been
tracked down and root caused by more than one distribution to be related to ghost writes inside the storage array following
a command that was replayed by the host during a KATO event.

This has been discussed multiple times in the past on the email list. For example here:

https://lore.kernel.org/linux-nvme/1cea7a75-8397-126c-8a2e-8e08948237b1@grimberg.me/

And in many conversations about this on the email list and at LSF/MM in the past it was agreed that we would
solve this problem by modifying the NVMe Specification.  This is what TP4129 is. We wrote TP4129 specifically to
address this ghost write problem and we did so by adding this optional CQT timer mechanism. This optional CQT timer
was designed to have the lowest possible impact on the existing KATO mechanism. Only storage arrays which need the
addition time following a time out would need to fill in the CQT (which is a new field in the Identify Controller
Data Structure).  This would allow the host to support both storage arrays that don't need the additional timer,
with no impact, and solve the problem of storage arrays that are currently broken because they do need an additional
timer. That is what TP4129 did.

As for TP8028, this undisclosed technical proposal will not replace TP4129. Everyone involved in the development
of these technical proposals at NVMexpress.org/FMDS has agreed that TP8028 will be an enhancement of TP4129 and that many
storage array vendors may not implement it.

Red Hat is currently involved in implementing and testing changes to support TP4129 in collaboration with those vendors.
If it can be proven that those changes don't work, or that the CQT timer is not needed, I can agree with Christoph that
support for TP4129 may not be needed in the upstream Linux host.  However, in the absence of any kind of empirical data
that says a CQT timer is not needed, I can't support any arbitrary decision that says we are not going to implement TP4129
or even talk about it. I plan to attend LSF/MM this year and one of my proposed topics is this one. Too many people have
been working on this for too long to say that we are not going to talk about it.

Right now the Linux KATO implementation is suffering from all kinds of problems. It needs to be improved and repaired. I
see no reason why, as a part of that process, support for a CQT timer can not or should not be implemented and tested in the Linux
host. Any way we do this the NVMe keep alive timeout mechanism needs to be there and it remains as the error recovery mechanism
of last resort. We need to continue to test and improve the KATO mechanism whether TP4129 is implemented or not.

/John