[nvme_tcp] possible data corruption when host detected path error and resubmit IOs on another path

Thu Feb 6 21:23:18 PST 2025

Hi John,

Thanks for the info. This issue can be reproduced by this way,

1. The NVMe Host connects to two paths with a keep-alive time of 20 seconds. The I/O policy uses the
NUMA method, with only one path used to send I/O at a time. First, we try to use fio to generate I/O
and confirm which path the Host uses to send I/O, referred to as Path 1 below.

root at localhost:~# cat /etc/fedora-release
Fedora release 40 (Forty)

root at localhost:~# uname -r
6.8.5-301.fc40.x86_64

root at localhost:~# nvme list-subsys /dev/nvme0n1
nvme-subsys0 - NQN=nqn.2020-12.com.smartx:system:linux-nvmf-subsys1
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:fef62b42-0c64-8019-065b-edf0617f74f8
               iopolicy=numa
\
 +- nvme0 tcp traddr=10.10.130.83,trsvcid=8009,src_addr=10.10.130.206 live optimized
 +- nvme1 tcp traddr=10.10.130.84,trsvcid=8009,src_addr=10.10.130.206 live optimized

2. Inject a 40-second I/O delay at the Target side of Path 1, exceeding the default NVMe I/O timeout
of 30 seconds. This delay will not affect the keep-alive. We use our own error injection tool to add
such I/O delay.

3. The Host uses dd to send two write I/O and one read I/O on the same offset.

root at localhost:~# echo 234 | dd of=/dev/nvme0n1 oflag=direct bs=512 count=1; \
> echo 567 | dd of=/dev/nvme0n1 oflag=direct bs=512 count=1; \
> dd if=/dev/nvme0n1 bs=4 count=1 2>/dev/null | hexdump -C

0+1 records in
0+1 records out
4 bytes copied, 30.3181 s, 0.0 kB/s          <<<< I/O 1(data: 234) is delayed and timed out, and then
                                                  completes on the other path successfully
0+1 records in
0+1 records out
4 bytes copied, 0.000483626 s, 8.3 kB/s      <<<< I/O 2(data: 567) completes successfully

00000000  35 36 37 0a                  |567.|     <<<< data is 567 now
00000004

4. 28 seconds after the Host sends I/O 1, the Target side of Path 1 shuts down the NIC.

[root at node130-83 10:55:34 ~]$ sleep 28; ifconfig port-access down

5. Wait until the Host completes step 3. After a while, the Host reads it again but gets 234. Data
corruption happens.

root at localhost:~# dd if=/dev/nvme0n1 bs=4 count=1 2>/dev/null | hexdump -C
00000000  32 33 34 0a                  |234.|     <<<< data rolls back to 234
00000004

Here is the explanation,

1. At 0s, the Host sends I/O 1 (data: 234). I/O 1 reaches the Target side of Path 1 and is delayed.

2. At 28s, the network card at the Target side of Path 1 is shut down.

3. At 30s, the Host detects that I/O 1 has timed out, triggering error recovery. At this point, the
Host may close the connection to Path 1. But since the NIC at the Target side of Path 1 is down, the
Target cannot detect the disconnection and cannot abort the I/O(It must wait for the Keep-Alive
timeout). The Host retries I/O 1 on another path and completes it, then sends I/O 2 (data: 567) and
completes it. The data is 567 now.

4. At 40s, the Target side of Path 1 continues to execute I/O 1. The data is rolled back from 567 to
234, data corruption happens!

5. At 48s, the Target side of Path 1 detects the Keep-Alive timeout and disconnects the connection.

There may be a question whether adjusting the I/O timeout on the Target side (e.g., setting it to
less than NVMe’s default 30-second I/O timeout) could prevent this issue by allowing the Target side
to abort I/O 1 before the Host detects the I/O timeout. However, if the network is poor, the Target
may have already received the I/O after a long delay and would not be able to abort it in time.

I think the patchset in [1] cannot completely fix this issue. To fix it, after the Host finds I/O 1
has timed out and closes the connection, it should wait for 2 * KATO to let the Target detect the
connection loss and an extra CQT to let the Target abort the inflight I/O, then the Host is free to
retry I/O on the other path.

Feel free to let me know if you need more info.

Thanks,
Jiewei

[1] https://lore.kernel.org/linux-nvme/1cea7a75-8397-126c-8a2e-8e08948237b1@grimberg.me/

> 2025年2月7日 00:13，John Meneghini <jmeneghi at redhat.com> 写道：
> 
> Hi Jiewei.
> 
> There have been issues reported on this list[1] were some NVMe/TCP attached storage arrays are
> experiencing data corruption during error insertion testing. These data corruption issues have
> been attributed to ghost writes in the storage array following commands replayed on the host
> during Keep Alive Timeout testing. TP4129 was developed to address those issues by introducing
> a Command Quiescence Time (CQT) timer to the protocol. TP4129 is now a part of the NVMe 2.1
> specification.
> 
> My question is: did you find this problem through some kind of testing or is this something
> you've seen during code inspection?  Are you reporting this because you've experienced data
> corruption with your NVMe/TCP controller?
> 
> If you have a test of some kind that allows an easy reproduction of this problem it would
> be helpful.
> 
> /John
> 
> [1] https://lore.kernel.org/linux-nvme/1cea7a75-8397-126c-8a2e-8e08948237b1@grimberg.me/