[PATCH 3/3] net: hisilicon: new hip04 ethernet driver

Arnd Bergmann arnd at arndb.de
Wed Apr 2 02:21:45 PDT 2014


On Tuesday 01 April 2014 21:27:12 Zhangfei Gao wrote:
> +static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)

While it looks like there are no serious functionality bugs left, this
function is rather inefficient, as has been pointed out before:

> +{
> +       struct hip04_priv *priv = netdev_priv(ndev);
> +       struct net_device_stats *stats = &ndev->stats;
> +       unsigned int tx_head = priv->tx_head;
> +       struct tx_desc *desc = &priv->tx_desc[tx_head];
> +       dma_addr_t phys;
> +
> +       hip04_tx_reclaim(ndev, false);
> +       mod_timer(&priv->txtimer, jiffies + RECLAIM_PERIOD);
> +
> +       if (priv->tx_count >= TX_DESC_NUM) {
> +               netif_stop_queue(ndev);
> +               return NETDEV_TX_BUSY;
> +       }

This is where you have two problems:

- if the descriptor ring is full, you wait for RECLAIM_PERIOD,
  which is far too long at 500ms, because during that time you
  are not able to add further data to the stopped queue.

- As David Laight pointed out earlier, you must also ensure that
  you don't have too much /data/ pending in the descriptor ring
  when you stop the queue. For a 10mbit connection, you have already
  tested (as we discussed on IRC) that 64 descriptors with 1500 byte
  frames gives you a 68ms round-trip ping time, which is too much.
  Conversely, on 1gbit, having only 64 descriptors actually seems
  a little low, and you may be able to get better throughput if
  you extend the ring to e.g. 512 descriptors.

> +       phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
> +       if (dma_mapping_error(&ndev->dev, phys)) {
> +               dev_kfree_skb(skb);
> +               return NETDEV_TX_OK;
> +       }
> +
> +       priv->tx_skb[tx_head] = skb;
> +       priv->tx_phys[tx_head] = phys;
> +       desc->send_addr = cpu_to_be32(phys);
> +       desc->send_size = cpu_to_be16(skb->len);
> +       desc->cfg = cpu_to_be32(DESC_DEF_CFG);
> +       phys = priv->tx_desc_dma + tx_head * sizeof(struct tx_desc);
> +       desc->wb_addr = cpu_to_be32(phys);

One detail: since you don't have cache-coherent DMA, "desc" will
reside in uncached memory, so you try to minimize the number of accesses.
It's probably faster if you build the descriptor on the stack and
then atomically copy it over, rather than assigning each member at
a time.

The same would be true for the rx descriptors.

> +       skb_tx_timestamp(skb);
> +
> +       /* Don't wait up for transmitted skbs to be freed. */
> +       skb_orphan(skb);
> +
> +       hip04_set_xmit_desc(priv, phys);
> +       priv->tx_head = TX_NEXT(tx_head);
> +
> +       stats->tx_bytes += skb->len;
> +       stats->tx_packets++;
> +       priv->tx_count++;
> +
> +       return NETDEV_TX_OK;
> +}

	Arnd



More information about the linux-arm-kernel mailing list