[PATCH] nvme-tcp: wait socket wmem to drain in queue stop

Sun Apr 13 15:25:05 PDT 2025

On 05/04/2025 8:48, Michael Liang wrote:
> This patch addresses a data corruption issue observed in nvme-tcp during
> testing.
>
> Issue description:
> In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> I/Os are canceled almost immediately after the kernel socket is shut down.
> These canceled I/Os are reported as host path errors, triggering a failover
> that succeeds on a different path.
>
> However, at this point, the original I/O may still be outstanding in the
> host's network transmission path (e.g., the NIC’s TX queue). From the
> user-space app's perspective, the buffer associated with the I/O is considered
> completed since they're acked on the different path and may be reused for new
> I/O requests.
>
> Because nvme-tcp enables zero-copy by default in the transmission path,
> this can lead to corrupted data being sent to the original target, ultimately
> causing data corruption.

This is unexpected.

1. before retrying the command, the host shuts down the socket.
2. the host sets sk_lingerime to 0, which means that
as soon as the socket is shutdown - the packet should not be able to 
transmit again
on the socket, zero-copy or not. Perhaps there is something not handled 
correctly
with linger=0? perhaps you should try with linger=<some-timeout> ?