[PATCH v28 01/20] net: Introduce direct data placement tcp offload
Eric Dumazet
edumazet at google.com
Wed May 14 00:12:28 PDT 2025
On Wed, Apr 30, 2025 at 1:58 AM Aurelien Aptel <aaptel at nvidia.com> wrote:
>
> From: Boris Pismenny <borisp at nvidia.com>
>
> This commit introduces direct data placement (DDP) offload for TCP.
>
> The motivation is saving compute resources/cycles that are spent
> to copy data from SKBs to the block layer buffers and CRC
> calculation/verification for received PDUs (Protocol Data Units).
>
> The DDP capability is accompanied by new net_device operations that
> configure hardware contexts.
>
> There is a context per socket, and a context per DDP operation.
> Additionally, a resynchronization routine is used to assist
> hardware handle TCP OOO, and continue the offload. Furthermore,
> we let the offloading driver advertise what is the max hw
> sectors/segments.
>
> The interface includes the following net-device ddp operations:
>
> 1. sk_add - add offload for the queue represented by socket+config pair
> 2. sk_del - remove the offload for the socket/queue
> 3. ddp_setup - request copy offload for buffers associated with an IO
> 4. ddp_teardown - release offload resources for that IO
> 5. limits - query NIC driver for quirks and limitations (e.g.
> max number of scatter gather entries per IO)
> 6. set_caps - request ULP DDP capabilities enablement
> 7. get_caps - request current ULP DDP capabilities
> 8. get_stats - query NIC driver for ULP DDP stats
>
> Using this interface, the NIC hardware will scatter TCP payload
> directly to the BIO pages according to the command_id.
>
> To maintain the correctness of the network stack, the driver is
> expected to construct SKBs that point to the BIO pages.
>
> The SKB passed to the network stack from the driver represents
> data as it is on the wire, while it is pointing directly to data
> in destination buffers.
>
> As a result, data from page frags should not be copied out to
> the linear part. To avoid needless copies, such as when using
> skb_condense, we mark the skb->no_condense bit.
> In addition, the skb->ulp_crc will be used by the upper layers to
> determine if CRC re-calculation is required. The two separated skb
> indications are needed to avoid false positives GRO flushing events.
>
> Follow-up patches will use this interface for DDP in NVMe-TCP.
>
> Capability bits stored in net_device allow drivers to report which
> ULP DDP capabilities a device supports. Control over these
> capabilities will be exposed to userspace in later patches.
>
>
>
> /**
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index beb084ee4f4d..38b800f2593d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -851,6 +851,8 @@ enum skb_tstamp_type {
> * @slow_gro: state present at GRO time, slower prepare step required
> * @tstamp_type: When set, skb->tstamp has the
> * delivery_time clock base of skb->tstamp.
> + * @no_condense: When set, don't condense fragments (DDP offloaded)
> + * @ulp_crc: CRC offloaded
> * @napi_id: id of the NAPI struct this skb came from
> * @sender_cpu: (aka @napi_id) source CPU in XPS
> * @alloc_cpu: CPU which did the skb allocation.
> @@ -1028,6 +1030,10 @@ struct sk_buff {
> __u8 csum_not_inet:1;
> #endif
> __u8 unreadable:1;
> +#ifdef CONFIG_ULP_DDP
> + __u8 no_condense:1;
> + __u8 ulp_crc:1;
> +#endif
Sorry for the late review.
I do not think you need a precious bit for no_condense feature.
After all, the driver knows it deals with DDP, and can make sure to
place the headers right before skb_shinfo() so that skb_tailroom() is
zero.
If you really must prevent any pull from a page frag to skb->head
(preventing skb->head realloc), skb->unreadable should work just fine
?
More information about the Linux-nvme
mailing list