[PATCH v28 01/20] net: Introduce direct data placement tcp offload

Wed May 14 00:12:28 PDT 2025

On Wed, Apr 30, 2025 at 1:58 AM Aurelien Aptel <aaptel at nvidia.com> wrote:
>
> From: Boris Pismenny <borisp at nvidia.com>
>
> This commit introduces direct data placement (DDP) offload for TCP.
>
> The motivation is saving compute resources/cycles that are spent
> to copy data from SKBs to the block layer buffers and CRC
> calculation/verification for received PDUs (Protocol Data Units).
>
> The DDP capability is accompanied by new net_device operations that
> configure hardware contexts.
>
> There is a context per socket, and a context per DDP operation.
> Additionally, a resynchronization routine is used to assist
> hardware handle TCP OOO, and continue the offload. Furthermore,
> we let the offloading driver advertise what is the max hw
> sectors/segments.
>
> The interface includes the following net-device ddp operations:
>
>  1. sk_add - add offload for the queue represented by socket+config pair
>  2. sk_del - remove the offload for the socket/queue
>  3. ddp_setup - request copy offload for buffers associated with an IO
>  4. ddp_teardown - release offload resources for that IO
>  5. limits - query NIC driver for quirks and limitations (e.g.
>              max number of scatter gather entries per IO)
>  6. set_caps - request ULP DDP capabilities enablement
>  7. get_caps - request current ULP DDP capabilities
>  8. get_stats - query NIC driver for ULP DDP stats
>
> Using this interface, the NIC hardware will scatter TCP payload
> directly to the BIO pages according to the command_id.
>
> To maintain the correctness of the network stack, the driver is
> expected to construct SKBs that point to the BIO pages.
>
> The SKB passed to the network stack from the driver represents
> data as it is on the wire, while it is pointing directly to data
> in destination buffers.
>
> As a result, data from page frags should not be copied out to
> the linear part. To avoid needless copies, such as when using
> skb_condense, we mark the skb->no_condense bit.
> In addition, the skb->ulp_crc will be used by the upper layers to
> determine if CRC re-calculation is required. The two separated skb
> indications are needed to avoid false positives GRO flushing events.
>
> Follow-up patches will use this interface for DDP in NVMe-TCP.
>
> Capability bits stored in net_device allow drivers to report which
> ULP DDP capabilities a device supports. Control over these
> capabilities will be exposed to userspace in later patches.
>
>
>

>  /**
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index beb084ee4f4d..38b800f2593d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -851,6 +851,8 @@ enum skb_tstamp_type {
>   *     @slow_gro: state present at GRO time, slower prepare step required
>   *     @tstamp_type: When set, skb->tstamp has the
>   *             delivery_time clock base of skb->tstamp.
> + *     @no_condense: When set, don't condense fragments (DDP offloaded)
> + *     @ulp_crc: CRC offloaded
>   *     @napi_id: id of the NAPI struct this skb came from
>   *     @sender_cpu: (aka @napi_id) source CPU in XPS
>   *     @alloc_cpu: CPU which did the skb allocation.
> @@ -1028,6 +1030,10 @@ struct sk_buff {
>         __u8                    csum_not_inet:1;
>  #endif
>         __u8                    unreadable:1;
> +#ifdef CONFIG_ULP_DDP
> +       __u8                    no_condense:1;
> +       __u8                    ulp_crc:1;
> +#endif

Sorry for the late review.

I do not think you need a precious bit for no_condense feature.

After all, the driver knows it deals with DDP, and can make sure to
place the headers right before skb_shinfo() so that skb_tailroom() is
zero.

If you really must prevent any pull from a page frag to skb->head
(preventing skb->head realloc), skb->unreadable should work just fine
?