[nvme_tcp] possible data corruption when host detected path error and resubmit IOs on another path

Mon Dec 30 18:32:47 PST 2024

Hi experts,

When the NVMe host driver detects a command timeout (either admin or I/O
command), it triggers error recovery and attempts to resubmit the I/Os on
another path, provided that native multipath is enabled.

Taking the nvme_tcp driver as an example, the nvme_tcp_timeout function is
called when any command times out:

nvme_tcp_timeout
  -> nvme_tcp_error_recovery
    -> nvme_tcp_error_recovery_work
      -> nvme_tcp_teardown_io_queues
        -> nvme_cancel_tagset

nvme_cancel_tagset completes the inflight requests on the failed path and
then calls nvme_failover_req to resubmit them on a different path. There is
no wait time before the I/O is resubmitted. This means that the controller
on the old path may not have fully cleaned up the pending requests,
potentially leading to data corruption on the NVMe namespace.

For example, consider the following scenario:

1. The host sends IO1 to path1, but then encounters a timeout for either IO1
or a previous I/O request (e.g., keep-alive or I/O timeout). This triggers
error recovery, and IO1 is retried on path2, which succeeds.

2. After that, the host sends IO2 with the same LBA to path2, which also
succeeds.

3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.

Ultimately, IO2 gets overwritten by the residual IO1, leading to potential
data corruption.

This issue can easily be reproduced in our distributed storage system.

I noticed that the NVMe Base Specification 2.1, section "9.6 Communication
Loss Handling," provides a good description of this scenario. It introduces
the concept of Command Quiesce Time (CQT), which allows for a cleanup period
for outstanding commands on the controller. Implementing CQT could
potentially resolve this issue.

Are there any plans to add CQT support to the NVMe host driver? In the
absence of this feature, is there any recommended workaround for this issue?
For instance, could using Device Mapper be a viable substitute?

Thank you, 
Jiewei