[PATCH net-next v8 3/3] net: axienet: Introduce dmaengine support

Fri Oct 20 12:27:50 PDT 2023

<snip>
> Minor nit: please use the reverse x-mas tree order

Sure, will fix it in next version.

<snip>
> You can avoid some duplicate code with:
> 		goto drop_skb;
> 
> and adding at the bottom of this function:
> 
> drop_skb:
> 	dev_kfree_skb_any(skb);
> 	return NETDEV_TX_OK;
> 

Agree , will switch to it in next version.

<snip>

> You forgot to add the netif_txq_maybe_stop() call, as suggested by Jakub in
> the previous revision.

I was in an impression that these are multi queue specific APIs, so I skipped them.
But revisited the implementation and it seems clear now, and modified the driver 
to use these lockless queue stopping / waking helpers.

+       tax = skb_get_tx_queue(lp->ndev, skb);
+       netdev_tx_sent_queue(txq, skb->len);
+       netif_txq_maybe_stop(txq, CIRC_SPACE(lp->tx_ring_head, lp->tx_ring_tail,
+                            TX_BD_NUM_MAX), MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS);

However, in netperf benchmark (TCP TX) I am seeing a dip in performance
(~35-40Mbps) when switching to these stop/wake helpers. Is it expected
considering extra logic in maintaining dynamic queue and these helpers?

Also, in throughput benchmarking there was no occurrence when the 
queue was stopped/woken up.

Throughput: (10^6bits/sec)
915.55 (v8 - without using lockless queue stop/wake helpers)

======
Switch to lockless queue stop/wake helpers

# netperf -H 192.168.10.20 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

131072  16384  16384    10.02     876.94

> 
> > +
> > +	skbuf_dma->skb = skb;
> > +	skbuf_dma->sg_len = sg_len;
> > +	dma_tx_desc->callback_param = lp;
> > +	dma_tx_desc->callback_result = axienet_dma_tx_cb;
> > +	dmaengine_submit(dma_tx_desc);
> > +	dma_async_issue_pending(lp->tx_chan);
> > +
> > +	return NETDEV_TX_OK;
> > +
> > +xmit_error_unmap_sg:
> > +	dma_unmap_sg(lp->dev, skbuf_dma->sgl, sg_len, DMA_TO_DEVICE);
> 
> If you need to drop the skb (as I suspect), you can reuse the drop_skb label
> here:
> 
> drop_skb:
> 	dev_kfree_skb_any(skb);
> > +	return NETDEV_TX_OK;

Yes , will make this change and this will also fix skb leak.

> > +}
> > +
> >  /**
> >   * axienet_tx_poll - Invoked once a transmit is completed by the
> >   * Axi DMA Tx channel.
> > @@ -893,6 +1028,42 @@ axienet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
> >  	return NETDEV_TX_OK;
> >  }
> >
> > +/**
> > + * axienet_dma_rx_cb - DMA engine callback for RX channel.
> > + * @data:       Pointer to the skbuf_dma_descriptor structure.
> > + * @result:     error reporting through dmaengine_result.
> > + * This function is called by dmaengine driver for RX channel to
> > +notify
> > + * that the packet is received.
> > + */
> > +static void axienet_dma_rx_cb(void *data, const struct
> > +dmaengine_result *result) {
> > +	struct axienet_local *lp = data;
> > +	struct skbuf_dma_descriptor *skbuf_dma;
> > +	size_t meta_len, meta_max_len, rx_len;
> > +	struct sk_buff *skb;
> > +	u32 *app_metadata;
> 
> Minor nit: please respect the reverse x-mas tree order

Yes, will fix it in next version.

> 
> > +
> > +	skbuf_dma = axienet_get_rx_desc(lp, lp->rx_ring_tail++);
> > +	skb = skbuf_dma->skb;
> > +	app_metadata = dmaengine_desc_get_metadata_ptr(skbuf_dma-
> >desc, &meta_len,
> > +						       &meta_max_len);
> > +	dma_unmap_single(lp->dev, skbuf_dma->dma_address, lp-
> >max_frm_size,
> > +			 DMA_FROM_DEVICE);
> > +	/* TODO: Derive app word index programmatically */
> > +	rx_len = (app_metadata[LEN_APP] & 0xFFFF);
> > +	skb_put(skb, rx_len);
> > +	skb->protocol = eth_type_trans(skb, lp->ndev);
> > +	skb->ip_summed = CHECKSUM_NONE;
> > +
> > +	__netif_rx(skb);
> 
> It's a pity you can't leverage NAPI here.
> 
> I think that could be doable as a follow-up, but I'm unsure if that would fit
> the DMA engine model: in this callback you could cache the ready dma index
> (a single range should suffice) and schedule the napi instance. The actual
> dma processing will be done in napi poll.
> 
> Another possible follow-up could be introducing a "bulk" RX callback in the
> DMA engine, to mitigate the indirect call overhead on a burst of RX DMA
> completion - assuming the DMA engine actually generates such burst.

Agree , these are possible thoughts and will start working on it once 
this baseline dmaengine support series is done.  
> 
> > +	u64_stats_update_begin(&lp->rx_stat_sync);
> > +	u64_stats_add(&lp->rx_packets, 1);
> > +	u64_stats_add(&lp->rx_bytes, rx_len);
> > +	u64_stats_update_end(&lp->rx_stat_sync);
> > +	axienet_rx_submit_desc(lp->ndev);
> > +	dma_async_issue_pending(lp->rx_chan);
> > +}
> > +
> >  /**
> >   * axienet_rx_poll - Triggered by RX ISR to complete the BD processing.
> >   * @napi:	Pointer to NAPI structure.
> > @@ -1126,6 +1297,150 @@ static irqreturn_t axienet_eth_irq(int irq,
> > void *_ndev)
> >
> >  static void axienet_dma_err_handler(struct work_struct *work);
> >
> > +/**
> > + * axienet_rx_submit_desc - Submit the rx descriptors to dmaengine.
> > + * allocate skbuff, map the scatterlist and obtain a descriptor
> > + * and then add the callback information and submit descriptor.
> > + *
> > + * @ndev:	net_device pointer
> > + *
> > + *Return: 0, on success.
> > + *          non-zero error value on failure
> > + */
> > +static int axienet_rx_submit_desc(struct net_device *ndev) {
> > +	struct dma_async_tx_descriptor *dma_rx_desc = NULL;
> > +	struct axienet_local *lp = netdev_priv(ndev);
> > +	struct skbuf_dma_descriptor *skbuf_dma;
> > +	struct sk_buff *skb;
> > +	dma_addr_t addr;
> > +	int ret;
> > +
> > +	skbuf_dma = axienet_get_rx_desc(lp, lp->rx_ring_head);
> > +	if (!skbuf_dma)
> > +		return -ENOMEM;
> 
> Minor nit: here a newline would make the core more readable

Accepted , will add in next version.

> 
> > +	lp->rx_ring_head++;
> > +	skb = netdev_alloc_skb(ndev, lp->max_frm_size);
> > +	if (!skb)
> > +		return -ENOMEM;
> 
> Another possible follow-up: usually the skb header is allocated just before
> sending it to the network stack (e.g. just before the
> __netif_rx() call) to be cache friendly. Here you could allocate just the data
> part and later use e.g. build_skb_around()

Sure, will explore on it.
> 
> > +
> > +	sg_init_table(skbuf_dma->sgl, 1);
> > +	addr = dma_map_single(lp->dev, skb->data, lp->max_frm_size,
> DMA_FROM_DEVICE);
> > +	if (unlikely(dma_mapping_error(lp->dev, addr))) {
> > +		if (net_ratelimit())
> > +			netdev_err(ndev, "DMA mapping error\n");
> > +		ret = -ENOMEM;
> > +		goto rx_submit_err_free_skb;
> > +	}
> > +	sg_dma_address(skbuf_dma->sgl) = addr;
> > +	sg_dma_len(skbuf_dma->sgl) = lp->max_frm_size;
> > +	dma_rx_desc = dmaengine_prep_slave_sg(lp->rx_chan, skbuf_dma-
> >sgl,
> > +					      1, DMA_DEV_TO_MEM,
> > +					      DMA_PREP_INTERRUPT);
> > +	if (!dma_rx_desc) {
> > +		ret = -EINVAL;
> > +		goto rx_submit_err_unmap_skb;
> > +	}
> > +
> > +	skbuf_dma->skb = skb;
> > +	skbuf_dma->dma_address = sg_dma_address(skbuf_dma->sgl);
> > +	skbuf_dma->desc = dma_rx_desc;
> > +	dma_rx_desc->callback_param = lp;
> > +	dma_rx_desc->callback_result = axienet_dma_rx_cb;
> > +	dmaengine_submit(dma_rx_desc);
> > +
> > +	return 0;
> > +
> > +rx_submit_err_unmap_skb:
> > +	dma_unmap_single(lp->dev, addr, lp->max_frm_size,
> DMA_FROM_DEVICE);
> > +rx_submit_err_free_skb:
> > +	dev_kfree_skb(skb);
> > +	return ret;
> 
> It looks like the error code is ignored by the caller. Possibly you can change
> this to a 'void' function.

will make it void in next version.

Thanks,
Radhey