[PATCH v12 07/26] nvme-tcp: Add DDP offload control path

Chaitanya Kulkarni chaitanyak at nvidia.com
Thu Aug 10 22:28:24 PDT 2023


On 8/9/2023 12:39 AM, Sagi Grimberg wrote:
> 
> 
> On 8/1/23 05:25, Chaitanya Kulkarni wrote:
>> On 7/12/23 09:14, Aurelien Aptel wrote:
>>> From: Boris Pismenny <borisp at nvidia.com>
>>>
>>> This commit introduces direct data placement offload to NVME
>>> TCP. There is a context per queue, which is established after the
>>> handshake using the sk_add/del NDOs.
>>>
>>> Additionally, a resynchronization routine is used to assist
>>> hardware recovery from TCP OOO, and continue the offload.
>>> Resynchronization operates as follows:
>>>
>>> 1. TCP OOO causes the NIC HW to stop the offload
>>>
>>> 2. NIC HW identifies a PDU header at some TCP sequence number,
>>> and asks NVMe-TCP to confirm it.
>>> This request is delivered from the NIC driver to NVMe-TCP by first
>>> finding the socket for the packet that triggered the request, and
>>> then finding the nvme_tcp_queue that is used by this routine.
>>> Finally, the request is recorded in the nvme_tcp_queue.
>>>
>>> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
>>> it with the PDU header TCP sequence, and report the result to the
>>> NIC driver (resync), which will update the HW, and resume offload
>>> when all is successful.
>>>
>>> Some HW implementation such as ConnectX-7 assume linear CCID (0...N-1
>>> for queue of size N) where the linux nvme driver uses part of the 16
>>> bit CCID for generation counter. To address that, we use the existing
>>> quirk in the nvme layer when the HW driver advertises if the device is
>>> not supports the full 16 bit CCID range.
>>>
>>> Furthermore, we let the offloading driver advertise what is the max hw
>>> sectors/segments via ulp_ddp_limits.
>>>
>>> A follow-up patch introduces the data-path changes required for this
>>> offload.
>>>
>>> Socket operations need a netdev reference. This reference is
>>> dropped on NETDEV_GOING_DOWN events to allow the device to go down in
>>> a follow-up patch.
>>>
>>> Signed-off-by: Boris Pismenny <borisp at nvidia.com>
>>> Signed-off-by: Ben Ben-Ishay <benishay at nvidia.com>
>>> Signed-off-by: Or Gerlitz <ogerlitz at nvidia.com>
>>> Signed-off-by: Yoray Zack <yorayz at nvidia.com>
>>> Signed-off-by: Shai Malin <smalin at nvidia.com>
>>> Signed-off-by: Aurelien Aptel <aaptel at nvidia.com>
>>> Reviewed-by: Chaitanya Kulkarni <kch at nvidia.com>
>>> ---
>>
>> For NVMe related code :-
>>
>> Offload feature is configurable and maybe not be turned on in the absence
>> of the H/W. In order to keep the nvme/host/tcp.c file small to only 
>> handle
>> core related functionality, I wonder if we should to move tcp-offload 
>> code
>> into it's own file say nvme/host/tcp-offload.c ?
> 
> Maybe. it wouldn't be tcp_offload.c but rather tcp_ddp.c because its not
> offloading the tcp stack but rather doing direct data placement.
>

fine ...

> If we are going to do that it will pollute nvme.h or add a common
> header file, which is something I'd like to avoid if possible.

my comment was mainly based on how we separated the core code from
configurable features, and wondering any decision we make for host will
also apply for the target ddp code in future, but you prefer to keep it 
as it is let's not bloat header ...

-ck





More information about the Linux-nvme mailing list