[PATCH v12 07/26] nvme-tcp: Add DDP offload control path

Mon Aug 14 11:54:55 PDT 2023

>>> +static inline bool is_netdev_ulp_offload_active(struct net_device *netdev,
>>> +                                             struct nvme_tcp_queue *queue)
>>> +{
>>> +     if (!netdev || !queue)
>>> +             return false;
>>
>> Is it reasonable to be called here with !netdev or !queue ?
> 
> The check is needed only for the IO queue case but we can move it
> earlier in nvme_tcp_start_queue().

I still don't understand even on io queues how this can happen.

>>> +
>>> +     /* If we cannot query the netdev limitations, do not offload */
>>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>> +             return false;
>>> +
>>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>> +             return true;
>>
>> This should be coming from the API itself, have the limits query
>> api fail if this is off.
> 
> We can move the function to the ULP DDP layer.
> 
>> btw, what is the active thing? is this driven from ethtool enable?
>> what happens if the user disables it while there is a ulp using it?
> 
> The active bits are indeed driven by ethtool according to the design
> Jakub suggested.
> The nvme-tcp connection will have to be reconnected to see the effect of
> changing the bit.

It should move inside the api as well. Don't want to care about it in
nvme.
>>> +
>>> +     return false;
>>
>> This can be folded to the above function.
> 
> We won't be able to check for TLS in a common wrapper. We think this
> should be kept.

Why? any tcp ddp need to be able to support tls. Nothing specific to
nvme here.

>>> +static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
>>> +{
>>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>>> +     struct ulp_ddp_config config = {.type = ULP_DDP_NVME};
>>> +     int ret;
>>> +
>>> +     config.nvmeotcp.pfv = NVME_TCP_PFV_1_0;
>>
>> Question, what happens if the pfv changes, is the ddp guaranteed to
>> work?
> 
> The existing HW supports only NVME_TCP_PFV_1_0.
> Once a new version will be used, the device driver should fail the
> sk_add().

OK.

>>> +/* In presence of packet drops or network packet reordering, the device may lose
>>> + * synchronization between the TCP stream and the L5P framing, and require a
>>> + * resync with the kernel's TCP stack.
>>> + *
>>> + * - NIC HW identifies a PDU header at some TCP sequence number,
>>> + *   and asks NVMe-TCP to confirm it.
>>> + * - When NVMe-TCP observes the requested TCP sequence, it will compare
>>> + *   it with the PDU header TCP sequence, and report the result to the
>>> + *   NIC driver
>>> + */
>>> +static void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
>>> +                                  struct sk_buff *skb, unsigned int offset)
>>> +{
>>> +     u64 pdu_seq = TCP_SKB_CB(skb)->seq + offset - queue->pdu_offset;
>>> +     struct net_device *netdev = queue->ctrl->offloading_netdev;
>>> +     u64 pdu_val = (pdu_seq << 32) | ULP_DDP_RESYNC_PENDING;
>>> +     u64 resync_val;
>>> +     u32 resync_seq;
>>> +
>>> +     resync_val = atomic64_read(&queue->resync_req);
>>> +     /* Lower 32 bit flags. Check validity of the request */
>>> +     if ((resync_val & ULP_DDP_RESYNC_PENDING) == 0)
>>> +             return;
>>> +
>>> +     /*
>>> +      * Obtain and check requested sequence number: is this PDU header
>>> +      * before the request?
>>> +      */
>>> +     resync_seq = resync_val >> 32;
>>> +     if (before(pdu_seq, resync_seq))
>>> +             return;
>>> +
>>> +     /*
>>> +      * The atomic operation guarantees that we don't miss any NIC driver
>>> +      * resync requests submitted after the above checks.
>>> +      */
>>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>> +                          atomic64_read(&queue->resync_req))
>>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>> +                                                     queue->sock->sk,
>>> +                                                     pdu_seq);
>>
>> Who else is doing an atomic on this value? and what happens
>> if the cmpxchg fails?
> 
> The driver thread can set queue->resync_req concurrently in patch
> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
> nvmeotcp_update_resync().
> 
> If the cmpxchg fails it means a new resync request was triggered by the
> HW, the old request will be dropped and the new one will be processed by
> a later PDU.

So resync_req is actually the current tcp sequence number or something?
The name resync_req is very confusing.

> 
>>> +}
>>> +
>>> +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>>> +{
>>> +     struct nvme_tcp_queue *queue = sk->sk_user_data;
>>> +
>>> +     /*
>>> +      * "seq" (TCP seq number) is what the HW assumes is the
>>> +      * beginning of a PDU.  The nvme-tcp layer needs to store the
>>> +      * number along with the "flags" (ULP_DDP_RESYNC_PENDING) to
>>> +      * indicate that a request is pending.
>>> +      */
>>> +     atomic64_set(&queue->resync_req, (((uint64_t)seq << 32) | flags));
>>
>> Question, is this coming from multiple contexts? what contexts are
>> competing here that make it an atomic operation? It is unclear what is
>> going on here tbh.
> 
> The driver could get a resync request and set queue->resync_req
> concurrently while processing HW CQEs as you can see in patch
> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
> nvmeotcp_update_resync().
> 
> The resync flow is:
> 
>       nvme-tcp                           mlx5                     hw
>          |                                |                        |
>          |                                |                      sends CQE with
>          |                                |                      resync request
>          |                                | <----------------------'
>          |                         nvmeotcp_update_resync()
>    nvme_tcp_resync_request() <-----------'|
>    we store the request
> 
> Later, while receiving PDUs we check for pending requests.
> If there is one, we send call nvme_tcp_resync_response() which calls
> into mlx5 to send the response to the HW.
> 

...

>>> +                     ret = nvme_tcp_offload_socket(queue);
>>> +                     if (ret) {
>>> +                             dev_info(nctrl->device,
>>> +                                      "failed to setup offload on queue %d ret=%d\n",
>>> +                                      idx, ret);
>>> +                     }
>>> +             }
>>> +     } else {
>>>                ret = nvmf_connect_admin_queue(nctrl);
>>> +             if (ret)
>>> +                     goto err;
>>>
>>> -     if (!ret) {
>>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>> -     } else {
>>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>> -                     __nvme_tcp_stop_queue(queue);
>>> -             dev_err(nctrl->device,
>>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>>
>> Is there any chance that this is a different netdev than what is
>> already recorded? doesn't make sense to me.
> 
> The idea is that we are first starting the admin queue, which looks up
> the netdev associated with the socket and stored in the queue. Later,
> when the IO queues are started, we use the recorded netdev.
> 
> In cases of bonding or vlan, a netdev can have lower device links, which
> get_netdev_for_sock() will look up.

I think the code should in high level do:
	if (idx) {
		ret = nvmf_connect_io_queue(nctrl, idx);
		if (ret)
			goto err;
		if (nvme_tcp_ddp_query_limits(queue))
			nvme_tcp_offload_socket(queue);

	} else {
		ret = nvmf_connect_admin_queue(nctrl);
		if (ret)
			goto err;
		ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
		if (nvme_tcp_ddp_query_limits(queue))
			nvme_tcp_offload_apply_limits(queue);
	}

ctrl->ddp_netdev should be cleared and put when the admin queue
is stopped/freed, similar to how async_req is handled.

>>> +                     goto done;
>>> +             }
>>> +             if (is_netdev_ulp_offload_active(netdev, queue))
>>> +                     nvme_tcp_offload_apply_limits(queue, netdev);
>>> +             /*
>>> +              * release the device as no offload context is
>>> +              * established yet.
>>> +              */
>>> +             dev_put(netdev);
>>
>> the put is unclear, what does it pair with? the get_netdev_for_sock?
> 
> Yes, get_netdev_for_sock() takes a reference, which we don't need at
> that point so we put it.

Well, you store a pointer to it, what happens if it goes away while
the controller is being set up?