[nvme_tcp] possible data corruption when host detected path error and resubmit IOs on another path

Thu Feb 6 08:56:41 PST 2025

Hello,

We also consider adding CQT support to our NVMe-TCP target solution to deal with the below potential data corruption.
I can say that its NOT only a theoretical issue, we were able to reproduce this data corruption.

+1 for Jiewei's question --> "Are there any plans to add CQT support to the NVMe host driver?"

Thanks,
Amit E

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces at lists.infradead.org> On Behalf Of John Meneghini
Sent: Thursday, 6 February 2025 18:13
To: Jiewei Ke <kejiewei.cn at gmail.com>; linux-nvme at lists.infradead.org
Cc: Sagi Grimberg <sagi at grimberg.me>; Hannes Reinecke <hare at suse.de>; Daniel Wagner <dwagner at suse.de>; Christoph Hellwig <hch at lst.de>
Subject: Re: [nvme_tcp] possible data corruption when host detected path error and resubmit IOs on another path

[EXTERNAL EMAIL] 

Hi Jiewei.

There have been issues reported on this list[1] were some NVMe/TCP attached storage arrays are experiencing data corruption during error insertion testing. These data corruption issues have been attributed to ghost writes in the storage array following commands replayed on the host during Keep Alive Timeout testing. TP4129 was developed to address those issues by introducing a Command Quiescence Time (CQT) timer to the protocol. TP4129 is now a part of the NVMe 2.1 specification.

My question is: did you find this problem through some kind of testing or is this something you've seen during code inspection?  Are you reporting this because you've experienced data corruption with your NVMe/TCP controller?

If you have a test of some kind that allows an easy reproduction of this problem it would be helpful.

/John

[1] https://urldefense.com/v3/__https://lore.kernel.org/linux-nvme/1cea7a75-8397-126c-8a2e-8e08948237b1@grimberg.me/__;!!LpKI!kMstAydDh9Va2y7qmNaFzMPk5hgDFfiououfyFqFvAjbuhTWpUFQk8GVp2QvWssbNoeEjh8W-DUmsdFb$ [lore[.]kernel[.]org]

On 12/30/24 9:32 PM, Jiewei Ke wrote:
> Hi experts,
> 
> When the NVMe host driver detects a command timeout (either admin or 
> I/O command), it triggers error recovery and attempts to resubmit the 
> I/Os on another path, provided that native multipath is enabled.
> 
> Taking the nvme_tcp driver as an example, the nvme_tcp_timeout 
> function is called when any command times out:
> 
> nvme_tcp_timeout
>    -> nvme_tcp_error_recovery
>      -> nvme_tcp_error_recovery_work
>        -> nvme_tcp_teardown_io_queues
>          -> nvme_cancel_tagset
> 
> nvme_cancel_tagset completes the inflight requests on the failed path 
> and then calls nvme_failover_req to resubmit them on a different path. 
> There is no wait time before the I/O is resubmitted. This means that 
> the controller on the old path may not have fully cleaned up the 
> pending requests, potentially leading to data corruption on the NVMe namespace.
> 
> For example, consider the following scenario:
> 
> 1. The host sends IO1 to path1, but then encounters a timeout for 
> either IO1 or a previous I/O request (e.g., keep-alive or I/O 
> timeout). This triggers error recovery, and IO1 is retried on path2, which succeeds.
> 
> 2. After that, the host sends IO2 with the same LBA to path2, which 
> also succeeds.
> 
> 3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.
> 
> Ultimately, IO2 gets overwritten by the residual IO1, leading to 
> potential data corruption.
> 
> This issue can easily be reproduced in our distributed storage system.
> 
> I noticed that the NVMe Base Specification 2.1, section "9.6 
> Communication Loss Handling," provides a good description of this 
> scenario. It introduces the concept of Command Quiesce Time (CQT), 
> which allows for a cleanup period for outstanding commands on the 
> controller. Implementing CQT could potentially resolve this issue.
> 
> Are there any plans to add CQT support to the NVMe host driver? In the 
> absence of this feature, is there any recommended workaround for this issue?
> For instance, could using Device Mapper be a viable substitute?
> 
> Thank you,
> Jiewei
>