[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Fri Aug 25 07:25:41 PDT 2017

Hi all,

On Sun, Aug 20, 2017 at 12:30 AM, John Crispin <john at phrozen.org> wrote:
> correct, in my testing i have been ... with 200 parallel flows ... on
> MT7623, we'll have to find out what mt7621 can achieve ... this is all using
> hwnat ...
> 1) tcp - at 50 byte frames i am able to pass 720 MBit which is > 1M FPS
> 2) udp - at 128 byte frames i am able to pass ~450k FPS at ~10% packet loss
> .. at near wirespeed

I have spent the last two days looking into this. My testing was based
on LEDE master as of yesterday morning and my initial test setup was
the following:

Server (Intel NUC) <-> Gbit Switch <-> ZBT 2926 <-> Client

The switch was tested and confirmed working at gigabit speeds. I used
iperf for my tests, with a payload of 100B and configured port
forwarding of UDP port 1203 from ZBT to client. I then ran the
following command on the NUC in a loop:

iperf -u -c 10.1.2.63 -t 3600 -d -p 1203 -l 100B -b 1000M

I left the test running over night (around 16 hours of pushing data),
but no error had been triggered as of this morning. Using bwm-ng, I
saw that the NUC was able to push around 40 Mbit/s, which, based on
earlier tests I have done where I have used the NUC as traffic
generator, seemed a bit low. I don't know if it is relevant, but when
capturing traffic (on both NUC and client) I saw pause packets quite
frequently.

Since this tests did not yield any result, and throughput was low, I
looked at some of the setups where I have seen this error. In all
setups, there is always something placed in front of the 2926 (a
router, switch, ...). I therefore modified my test setup to be as
follows:

Server (Intel NUC) <-> Gbit Switch <-> ZBT 2926 #1 <-> ZBT 2926 #2 <-> Client

I forwarded port 1203 on the new ZBT router and repeated the
experiment. Using this setup, the NUC pushed about 260Mbit/s and I am
reliably able to trigger the error within ~1000 seconds. The error is
always seen on ZBT #1, and sometimes on ZBT #2. If I see the error on
#2 it is always at a later time than #1, so it seems that the two
routers somehow affect each other. When looking at the RX bandwidth on
the client (using bwm-ng), I see that it is very bursty. I receive
data at about 32Mbit/s, then no data for a while, then back to around
32 Mbit/s, and so on, until the error is triggered and switch (TX) on
the router(s) die. Pause frames are also seen on both server and
client in this experiment.

After having found a way to reliably trigger the issue, I tested the
patch provided by Mingyu. With this patch, the error is triggered much
faster, usually after around 300 seconds.

Mingyu, do you have any other ideas on what could be wrong or how to fix this?

John, would it be possible to get access to your staged commit, so
that I can repeat the test using your new code?

Thanks for all the help,
Kristian