[PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload

Thu Dec 10 21:43:57 EST 2020

On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
>> Yes, TCP is a byte stream, so the packets could very well show up like this:
>>
>>  +--------------+---------+-----------+---------+--------+-----+
>>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>>  +--------------+---------+-----------+---------+--------+-----+
>>  +-----------------------------------+---------+--------+-----+
>>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>>  +-----------------------------------+---------+--------+-----+
>>  +-------- +-------------------------+---------+--------+-----+
>>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>>  +---------+-------------------------+---------+--------+-----+
>>
>> If your hardware can extract the NVMe payload into a targeted SGL like
>> you want in this set, then it has some logic for parsing headers and
>> "snapping" an SGL to a new element. ie., it already knows 'prev data'
>> goes with the in-progress PDU, sees more data, recognizes a new PDU
>> header and a new payload. That means it already has to handle a
>> 'snap-to-PDU' style argument where the end of the payload closes out an
>> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
>> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
>> this case, you want 'snap-to-PDU' but that could just as easily be 'no
>> snap at all', just a byte stream and filling an SGL after the protocol
>> headers.
> 
> This 'snap-to-PDU' requirement is something that I don't understand
> with the current TCP zero copy. In case of, say, a storage application

current TCP zero-copy does not handle this and it can't AFAIK. I believe
it requires hardware level support where an Rx queue is dedicated to a
flow / socket and some degree of header and payload splitting (header is
consumed by the kernel stack and payload goes to socket owner's memory).

> which wants to send some headers (whatever RPC info, block number,
> etc.) and then a 4k block of data - how does the RX side get just the
> 4k block a into a page so it can zero copy it out to its storage device?
> 
> Per-connection state in the NIC, and FW parsing headers is one way,
> but I wonder how this record split problem is best resolved generically.
> Perhaps by passing hints in the headers somehow?
> 
> Sorry for the slight off-topic :)
> 
Hardware has to be parsing the incoming packets to find the usual
ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
have some kind of ULP processor to know how to parse the TCP byte stream
at least well enough to find the PDU header and interpret it to get pdu
header length and payload length.

At that point you push the protocol headers (eth/ip/tcp) into one buffer
for the kernel stack protocols and put the payload into another. The
former would be some page owned by the OS and the latter owned by the
process / socket (generically, in this case it is a kernel level
socket). In addition, since the payload is spread across multiple
packets the hardware has to keep track of TCP sequence number and its
current place in the SGL where it is writing the payload to keep the
bytes contiguous and detect out-of-order.

If the ULP processor knows about PDU headers it knows when enough
payload has been found to satisfy that PDU in which case it can tell the
cursor to move on to the next SGL element (or separate SGL). That's what
I meant by 'snap-to-PDU'.

Alternatively, if it is any random application with a byte stream not
understood by hardware, the cursor just keeps moving along the SGL
elements assigned it for this particular flow.

If you have a socket whose payload is getting offloaded to its own queue
(which this set is effectively doing), you can create the queue with
some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
that controls the parsing when you stop writing to one SGL element and
move on to the next. Again, assuming hardware support for such attributes.

I don't work for Nvidia, so this is all supposition based on what the
patches are doing.