[PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
Boris Pismenny
borispismenny at gmail.com
Thu Dec 17 14:06:35 EST 2020
On 15/12/2020 7:19, David Ahern wrote:
> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>> as zerocopy for the following reasons:
>>>> (1) The former places buffers *exactly* where the user requests
>>>> regardless of the order of response arrivals, while the latter places packets
>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>> can be implemented using data placement, but not vice versa.
>>>
>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>> starting point where seq N == sgl element 0, position 0. Presumably
>>> there is a hardware cursor to track where you are in filling the SGL as
>>> packets are processed. You abort on OOO, so it seems like a fairly
>>> straightfoward problem.
>>>
>>
>> We do not abort on OOO. Moreover, we can keep going as long as
>> PDU headers are not reordered.
>
> Meaning packets received OOO which contain all or part of a PDU header
> are aborted, but pure data packets can arrive out-of-order?
>
> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
> benchmarks?
>
Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
At the worst case we measured (5% OOO), and we got the same performance
as pure software TLS under these conditions. We will strive to have the
same for nvme-tcp. We would be able to test this on nvme-tcp only when we
have hardware. For now, we are using a mix of emulation and simulation to
test and benchmark.
>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>> allows other use cases to build on this as opposed to the checks you
>>> want for a special case. For example, a split at the protocol headers /
>>> payload boundaries would be a generic solution where kernel managed
>>> protocols get data in one buffer and socket data is put into a given
>>> SGL. I am guessing that you have to be already doing this to put PDU
>>> payloads into an SGL and other headers into other memory to make a
>>> complete packet, so this is not too far off from what you are already doing.
>>>
>>
>> Splitting at protocol header boundaries and placing data at socket defined
>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>> responses. Here is an example:
>>
>> the host submits the following requests:
>> +--------+--------+--------+
>> | Read 1 | Read 2 | Read 3 |
>> +--------+--------+--------+
>>
>> the target responds with the following responses:
>> +--------+--------+--------+
>> | Resp 2 | Resp 3 | Resp 1 |
>> +--------+--------+--------+
>
> Does 'Resp N' == 'PDU + data' like this:
>
> +---------+--------+---------+--------+---------+--------+
> | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
> +---------+--------+---------+--------+---------+--------+
>
> or is it 1 PDU hdr + all of the responses?
>
Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
chooses to do so. The PDU header's command_id field correlates between
the request and the response. We use that correlation in hardware to
identify the buffers where data needs to be scattered. In other words,
hardware holds a map between command_id and block layer buffers SGL.
>>
>> I think that the interface we created (tcp_ddp) is sufficiently generic
>> for the task at hand, which is offloading protocols that can re-order
>> their responses, a non-trivial task that we claim is important.
>>
>> We designed it to support other protocols and not just nvme-tcp,
>> which is merely an example. For instance, I think that supporting iSCSI
>> would be natural, and that other protocols will fit nicely.
>
> It would be good to add documentation that describes the design, its
> assumptions and its limitations. tls has several under
> Documentation/networking. e.g., one important limitation to note is that
> this design only works for TCP sockets owned by kernel modules.
>
You are right. I'll do so for tcp_ddp. You are right that it works only
for kernel TCP sockets, but maybe future work will extend it.
>>
>>> ###
>>>
>>> A dump of other comments about this patch set:
>>
>> Thanks for reviewing! We will fix and resubmit.
>
> Another one I noticed today. You have several typecasts like this:
>
> cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
>
> since cqe is a member of cqe128, container_of should be used instead of
> the magic '- 64'.
>
Will fix
More information about the Linux-nvme
mailing list