[nvme_tcp] possible data corruption when host detected path error and resubmit IOs on another path

Thu Feb 6 08:13:17 PST 2025

Hi Jiewei.

There have been issues reported on this list[1] were some NVMe/TCP attached storage arrays are
experiencing data corruption during error insertion testing. These data corruption issues have
been attributed to ghost writes in the storage array following commands replayed on the host
during Keep Alive Timeout testing. TP4129 was developed to address those issues by introducing
a Command Quiescence Time (CQT) timer to the protocol. TP4129 is now a part of the NVMe 2.1
specification.

My question is: did you find this problem through some kind of testing or is this something
you've seen during code inspection?  Are you reporting this because you've experienced data
corruption with your NVMe/TCP controller?

If you have a test of some kind that allows an easy reproduction of this problem it would
be helpful.

/John

[1] https://lore.kernel.org/linux-nvme/1cea7a75-8397-126c-8a2e-8e08948237b1@grimberg.me/

On 12/30/24 9:32 PM, Jiewei Ke wrote:
> Hi experts,
> 
> When the NVMe host driver detects a command timeout (either admin or I/O
> command), it triggers error recovery and attempts to resubmit the I/Os on
> another path, provided that native multipath is enabled.
> 
> Taking the nvme_tcp driver as an example, the nvme_tcp_timeout function is
> called when any command times out:
> 
> nvme_tcp_timeout
>    -> nvme_tcp_error_recovery
>      -> nvme_tcp_error_recovery_work
>        -> nvme_tcp_teardown_io_queues
>          -> nvme_cancel_tagset
> 
> nvme_cancel_tagset completes the inflight requests on the failed path and
> then calls nvme_failover_req to resubmit them on a different path. There is
> no wait time before the I/O is resubmitted. This means that the controller
> on the old path may not have fully cleaned up the pending requests,
> potentially leading to data corruption on the NVMe namespace.
> 
> For example, consider the following scenario:
> 
> 1. The host sends IO1 to path1, but then encounters a timeout for either IO1
> or a previous I/O request (e.g., keep-alive or I/O timeout). This triggers
> error recovery, and IO1 is retried on path2, which succeeds.
> 
> 2. After that, the host sends IO2 with the same LBA to path2, which also
> succeeds.
> 
> 3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.
> 
> Ultimately, IO2 gets overwritten by the residual IO1, leading to potential
> data corruption.
> 
> This issue can easily be reproduced in our distributed storage system.
> 
> I noticed that the NVMe Base Specification 2.1, section "9.6 Communication
> Loss Handling," provides a good description of this scenario. It introduces
> the concept of Command Quiesce Time (CQT), which allows for a cleanup period
> for outstanding commands on the controller. Implementing CQT could
> potentially resolve this issue.
> 
> Are there any plans to add CQT support to the NVMe host driver? In the
> absence of this feature, is there any recommended workaround for this issue?
> For instance, could using Device Mapper be a viable substitute?
> 
> Thank you,
> Jiewei
>