nvme-tls and TCP window full
Sagi Grimberg
sagi at grimberg.me
Tue Jul 11 02:28:52 PDT 2023
Hey Hannes,
Any progress on this one?
On 7/7/23 02:18, Sagi Grimberg wrote:
>
>> Hi Sagi,
>
> Hey Hannes,
>
>> I'm currently debugging my nvme-tls patches; with the rebase to latest
>> linus' head things started to behave erratically as occasionally a CQE
>> was never received, triggering a reset.
>
> Bummer.
>
>> Originally I thought it's my read_sock() implementation which is to
>> blame, but then I fixed wireshark to do a proper frame dissecting
>> (wireshark merge req #11359), and found that it's rather an issue with
>> TCP window becoming full.
>> While this is arguably an issue with the sender (which is trying to
>> send 32k worth of data in one packet)
>
> This is before TSO. it is not actually 32K in a single packet. If you
> record on the switch, you will see that it is packetized to MTU size
> properly.
>
>> the connection never recovers from the window full state; the command
>> is dropped, never to reappear again.
>
> This is a bug. And sounds like a new issue to me. Does this happen
> without TLS patches applied? And without SPLICE patches from david?
>
>> I would have thought/hoped that we (from the NVMe side) would
>> be able to handle it better; in particular I'm surprised that we can
>> send large chunks of data at all.
>
> What do you mean? if we get R2T of a given size (within MDTS), we just
> send it all... What do you mean by handling it better? What would you
> expect the driver to do?
>
>> And that the packet is dropped due to a protocol error without us
>> notifying.
>
> TCP needs to handle it. It does not notify the consumer unless there
> is a TCP reset sent which breaks the connection. But I am not familiar
> with a scenario that a TCP segment is dropped and not retired. TCP is
> full of retry logic based on dropped packets (with rtt measurements)
> and ack/data arriving out-of-order...
>
>> So question really is: do we check for the TCP window size somewhere?
>> If so, where? Or is it something the lower layers have to do for us?
>
> We don't and we can't. The application is not and cannot be exposed to
> the TCP window size because it belongs to the very low level of
> congestion control mechanisms berried deep down in TCP, that's why
> socket buffers exist. The application sends until the socket buffer is
> full (and evicts when sent buffers are acknowledged within the ack'd
> sequence number).
>
>> Full packet dissection available on request.
>
> This does not sound like an nvme-tcp problem to me. Sounds like
> a breakage. It is possible that the sendpage conversion missed
> a flag or something, or that the stack behaves slightly different. Or
> something in your TLS patches and its interaction with the SPLICE
> patches.
>
> Can you send me your updated code (tls fixes plus nvme-tls). I suggest
> to start bisecting.
More information about the Linux-nvme
mailing list