[BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s

Wed Nov 20 12:38:28 EST 2013

On Wed, Nov 20, 2013 at 09:30:07AM -0800, Eric Dumazet wrote:
> Well, all TCP performance results are highly dependent on the workload,
> and both receivers and senders behavior.
> 
> We made many improvements like TSO auto sizing, DRS (dynamic Right
> Sizing), and if the application used some specific settings (like
> SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
> exact performance is reached from kernel version X to kernel version Y.

Of course, which is why I only care when there's a significant
difference. If I need 6 streams in a version and 8 in another one to
fill the wire, I call them identical. It's only when we dig into the
details that we analyse the differences.

> We try to make forward progress, there is little gain to revert all
> these great works. Linux had this tendency to favor throughput by using
> overly large skbs. Its time to do better.

I agree. Unfortunately our mails have crossed each other, so just to
keep this tread mostly linear, your next patch here :

   http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

Fixes that regression and the performance is back to normal which is
good.

> As explained, some drivers are buggy, and need fixes.

Agreed!

> If nobody wants to fix them, this really means no one is interested
> getting them fixed.

I was exactly reading the code when I found a window with your patch
above that I was looking for :-)

> I am willing to help if you provide details, because otherwise I need
> a crystal ball ;)
> 
> One known problem of TCP is the fact that an incoming ACK making room in
> socket write queue immediately wakeup a blocked thread (POLLOUT), even
> if only one MSS was ack, and write queue has 2MB of outstanding bytes.

Indeed.

> All these scheduling problems should be identified and fixed, and yes,
> this will require a dozen more patches.
> 
> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

Which is the purpose of your patch above and I confirm it fixes the
problem.

Now looking at how to workaround this lack of Tx IRQ.

Thanks!
Willy