[PATCH] nvme-tcp: wait socket wmem to drain in queue stop

Randy Jennings randyj at purestorage.com
Tue Apr 8 14:07:24 PDT 2025


On Fri, Apr 4, 2025 at 10:49 PM Michael Liang <mliang at purestorage.com> wrote:
>
> This patch addresses a data corruption issue observed in nvme-tcp during
> testing.
>
> Issue description:
> In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> I/Os are canceled almost immediately after the kernel socket is shut down.
> These canceled I/Os are reported as host path errors, triggering a failover
> that succeeds on a different path.
>
> However, at this point, the original I/O may still be outstanding in the
> host's network transmission path (e.g., the NIC’s TX queue). From the
> user-space app's perspective, the buffer associated with the I/O is considered
> completed since they're acked on the different path and may be reused for new
> I/O requests.
>
> Because nvme-tcp enables zero-copy by default in the transmission path,
> this can lead to corrupted data being sent to the original target, ultimately
> causing data corruption.
>
> We can reproduce this data corruption by injecting delay on one path and
> triggering i/o timeout.
>
> To prevent this issue, this change ensures that all inflight transmissions are
> fully completed from host's perspective before returning from queue stop.
> This aligns with the behavior of queue stopping in other NVMe fabric transports.
>
> Reviewed-by: Mohamed Khalfella <mkhalfella at purestorage.com>
> Reviewed-by: Randy Jennings <randyj at purestorage.com>
> Signed-off-by: Michael Liang <mliang at purestorage.com>

Through additional testing, we have recreated the corruption with this
patch.  We had a previous iteration of the patch that ran some time
without the corruption, and we convinced ourselves internally that a
portion of that version should not be needed.  So, unfortunately, it
looks like this patch is not sufficient to prevent the data
corruption.  We do believe the issue is still with the zero-copy and
too-quick retransmission (our tests showed that data that was only in
the buffer while userspace controlled the buffer was transmitted on
the wire), but we are still investigating.

Sincerely,
Randy Jennings



More information about the Linux-nvme mailing list