nvme-tls and TCP window full

Sagi Grimberg sagi at grimberg.me
Tue Jul 11 02:28:52 PDT 2023


Hey Hannes,

Any progress on this one?

On 7/7/23 02:18, Sagi Grimberg wrote:
> 
>> Hi Sagi,
> 
> Hey Hannes,
> 
>> I'm currently debugging my nvme-tls patches; with the rebase to latest 
>> linus' head things started to behave erratically as occasionally a CQE 
>> was never received, triggering a reset.
> 
> Bummer.
> 
>> Originally I thought it's my read_sock() implementation which is to 
>> blame, but then I fixed wireshark to do a proper frame dissecting 
>> (wireshark merge req #11359), and found that it's rather an issue with 
>> TCP window becoming full.
>> While this is arguably an issue with the sender (which is trying to 
>> send 32k worth of data in one packet)
> 
> This is before TSO. it is not actually 32K in a single packet. If you
> record on the switch, you will see that it is packetized to MTU size
> properly.
> 
>> the connection never recovers from the window full state; the command 
>> is dropped, never to reappear again.
> 
> This is a bug. And sounds like a new issue to me. Does this happen
> without TLS patches applied? And without SPLICE patches from david?
> 
>> I would have thought/hoped that we (from the NVMe side) would
>> be able to handle it better; in particular I'm surprised that we can 
>> send large chunks of data at all.
> 
> What do you mean? if we get R2T of a given size (within MDTS), we just
> send it all... What do you mean by handling it better? What would you
> expect the driver to do?
> 
>> And that the packet is dropped due to a protocol error without us 
>> notifying.
> 
> TCP needs to handle it. It does not notify the consumer unless there
> is a TCP reset sent which breaks the connection. But I am not familiar
> with a scenario that a TCP segment is dropped and not retired. TCP is
> full of retry logic based on dropped packets (with rtt measurements)
> and ack/data arriving out-of-order...
> 
>> So question really is: do we check for the TCP window size somewhere?
>> If so, where? Or is it something the lower layers have to do for us?
> 
> We don't and we can't. The application is not and cannot be exposed to
> the TCP window size because it belongs to the very low level of 
> congestion control mechanisms berried deep down in TCP, that's why
> socket buffers exist. The application sends until the socket buffer is
> full (and evicts when sent buffers are acknowledged within the ack'd
> sequence number).
> 
>> Full packet dissection available on request.
> 
> This does not sound like an nvme-tcp problem to me. Sounds like
> a breakage. It is possible that the sendpage conversion missed
> a flag or something, or that the stack behaves slightly different. Or
> something in your TLS patches and its interaction with the SPLICE
> patches.
> 
> Can you send me your updated code (tls fixes plus nvme-tls). I suggest
> to start bisecting.



More information about the Linux-nvme mailing list