Optimizing kernel compilation / alignments for network performance
Andrew Lunn
andrew at lunn.ch
Fri May 6 05:42:49 PDT 2022
> > I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> > This seems rather excessive, especially since most people are going to use a MTU of 1500.
> > My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> > This should significantly reduce the time spent on flushing caches.
>
> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> configure MTU and add support for frames beyond 8192 byte size"):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
>
> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
>
> I do all my testing with
> #define BGMAC_RX_MAX_FRAME_SIZE 1536
That helps show that cache operations are part of your bottleneck.
Taking a quick look at the driver. On the receive side:
/* Unmap buffer to make it accessible to the CPU */
dma_unmap_single(dma_dev, dma_addr,
BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
Here is data is mapped read for the CPU to use it.
/* Get info from the header */
len = le16_to_cpu(rx->len);
flags = le16_to_cpu(rx->flags);
/* Check for poison and drop or pass the packet */
if (len == 0xdead && flags == 0xbeef) {
netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
ring->start);
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_errors++;
break;
}
if (len > BGMAC_RX_ALLOC_SIZE) {
netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
ring->start);
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_length_errors++;
bgmac->net_dev->stats.rx_errors++;
break;
}
/* Omit CRC. */
len -= ETH_FCS_LEN;
skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
if (unlikely(!skb)) {
netdev_err(bgmac->net_dev, "build_skb failed\n");
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_errors++;
break;
}
skb_put(skb, BGMAC_RX_FRAME_OFFSET +
BGMAC_RX_BUF_OFFSET + len);
skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
BGMAC_RX_BUF_OFFSET);
skb_checksum_none_assert(skb);
skb->protocol = eth_type_trans(skb, bgmac->net_dev);
and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to
prefetch(buf);
just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.
Andrew
More information about the linux-arm-kernel
mailing list