nvme-tls and TCP window full

Sagi Grimberg sagi at grimberg.me
Thu Jul 6 16:18:15 PDT 2023


> Hi Sagi,

Hey Hannes,

> I'm currently debugging my nvme-tls patches; with the rebase to latest 
> linus' head things started to behave erratically as occasionally a CQE 
> was never received, triggering a reset.

Bummer.

> Originally I thought it's my read_sock() implementation which is to 
> blame, but then I fixed wireshark to do a proper frame dissecting 
> (wireshark merge req #11359), and found that it's rather an issue with 
> TCP window becoming full.
> While this is arguably an issue with the sender (which is trying to send 
> 32k worth of data in one packet)

This is before TSO. it is not actually 32K in a single packet. If you
record on the switch, you will see that it is packetized to MTU size
properly.

> the connection never recovers from the 
> window full state; the command is dropped, never to reappear again.

This is a bug. And sounds like a new issue to me. Does this happen
without TLS patches applied? And without SPLICE patches from david?

> I would have thought/hoped that we (from the NVMe side) would
> be able to handle it better; in particular I'm surprised that we can 
> send large chunks of data at all.

What do you mean? if we get R2T of a given size (within MDTS), we just
send it all... What do you mean by handling it better? What would you
expect the driver to do?

> And that the packet is dropped due to 
> a protocol error without us notifying.

TCP needs to handle it. It does not notify the consumer unless there
is a TCP reset sent which breaks the connection. But I am not familiar
with a scenario that a TCP segment is dropped and not retired. TCP is
full of retry logic based on dropped packets (with rtt measurements)
and ack/data arriving out-of-order...

> So question really is: do we check for the TCP window size somewhere?
> If so, where? Or is it something the lower layers have to do for us?

We don't and we can't. The application is not and cannot be exposed to
the TCP window size because it belongs to the very low level of 
congestion control mechanisms berried deep down in TCP, that's why
socket buffers exist. The application sends until the socket buffer is
full (and evicts when sent buffers are acknowledged within the ack'd
sequence number).

> Full packet dissection available on request.

This does not sound like an nvme-tcp problem to me. Sounds like
a breakage. It is possible that the sendpage conversion missed
a flag or something, or that the stack behaves slightly different. Or
something in your TLS patches and its interaction with the SPLICE
patches.

Can you send me your updated code (tls fixes plus nvme-tls). I suggest
to start bisecting.



More information about the Linux-nvme mailing list