[PATCH net-next 00/10] net: faster and simpler CRC32C computation
Eric Biggers
ebiggers at kernel.org
Thu May 15 12:50:51 PDT 2025
On Thu, May 15, 2025 at 08:21:36PM +0100, David Laight wrote:
> On Sun, 11 May 2025 16:07:50 -0700
> Eric Biggers <ebiggers at kernel.org> wrote:
>
> > On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote:
> > > On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew at lunn.ch> wrote:
> > > >
> > > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote:
> > > > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote:
> > > > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote:
> > > > > > > Update networking code that computes the CRC32C of packets to just call
> > > > > > > crc32c() without unnecessary abstraction layers. The result is faster
> > > > > > > and simpler code.
> > > > > >
> > > > > > Hi Eric
> > > > > >
> > > > > > Do you have some benchmarks for these changes?
> > > > > >
> > > > > > Andrew
> > > > >
> > > > > Do you want benchmarks that show that removing the indirect calls makes things
> > > > > faster? I think that should be fairly self-evident by now after dealing with
> > > > > retpoline for years, but I can provide more details if you need them.
> > > >
> > > > I was think more like iperf before/after? Show the CPU load has gone
> > > > down without the bandwidth also going down.
> > > >
> > > > Eric Dumazet has a T-Shirt with a commit message on the back which
> > > > increased network performance by X%. At the moment, there is nothing
> > > > T-Shirt quotable here.
> > > >
> > >
> > > I think that removing layers of redundant code to ultimately call the
> > > same core CRC-32 implementation is a rather obvious win, especially
> > > when indirect calls are involved. The diffstat speaks for itself, so
> > > maybe you can print that on a T-shirt.
> >
> > Agreed with Ard. I did try doing some SCTP benchmarks with iperf3 earlier, but
> > they were very noisy and the CRC32C checksumming seemed to be lost in the noise.
> > There probably are some tricks to running reliable networking benchmarks; I'm
> > not a networking developer. Regardless, this series is a clear win for the
> > CRC32C code, both from a simplicity and performance perspective. It also fixes
> > the kconfig dependency issues. That should be good enough, IMO.
> >
> > In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs
> > skb_crc32c (new):
> >
> > Linear sk_buffs
> >
> > Length in bytes __skb_checksum cycles skb_crc32c cycles
> > =============== ===================== =================
> > 64 43 18
> > 1420 204 161
> > 16384 1735 1642
> >
> > Nonlinear sk_buffs (even split between head and one fragment)
> >
> > Length in bytes __skb_checksum cycles skb_crc32c cycles
> > =============== ===================== =================
> > 64 579 22
> > 1420 1506 194
> > 16384 4365 1682
> >
> > So 1420-byte linear buffers (roughly the most common case) is 21% faster,
>
> 1420 bytes is unlikely to be the most common case - at least for some users.
> SCTP is message oriented so the checksum is over a 'user message'.
> A non-uncommon use is carrying mobile network messages (eg SMS) over the IP
> network (instead of TDM links).
> In that case the maximum data chunk size (what is being checksummed) is limited
> to not much over 256 bytes - and a lot of data chunks will be smaller.
> The actual difficulty is getting multiple data chunks into a single ethernet
> packet without adding significant delays.
>
> But the changes definitely improve things.
Interesting. Of course, the data I gave shows that the proportional performance
increase is even greater on short packets than long ones. I'll include those
tables when I resend the patchset and add a row for 256 bytes too.
- Eric
More information about the Linux-nvme
mailing list