[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Fri Aug 25 22:43:36 PDT 2017

Hi.

i check the code again. found xmit_more can cause tx timeout. you can
reference this.
https://www.mail-archive.com/netdev@vger.kernel.org/msg123334.html
so the patch should be like this. edit mtk_eth_soc.c

        tx_num = fe_cal_txd_req(skb);
        if (unlikely(fe_empty_txd(ring) <= tx_num)) {
+                if (skb->xmit_more)
+                        fe_reg_w32(ring->tx_next_idx, FE_REG_TX_CTX_IDX0);
                netif_stop_queue(dev);
                netif_err(priv, tx_queued, dev,
                          "Tx Ring full when queue awake!\n");

but i am not sure. maybe the pause frame cause the problem.

2017-08-25 22:25 GMT+08:00 Kristian Evensen <kristian.evensen at gmail.com>:
> Hi all,
>
> On Sun, Aug 20, 2017 at 12:30 AM, John Crispin <john at phrozen.org> wrote:
>> correct, in my testing i have been ... with 200 parallel flows ... on
>> MT7623, we'll have to find out what mt7621 can achieve ... this is all using
>> hwnat ...
>> 1) tcp - at 50 byte frames i am able to pass 720 MBit which is > 1M FPS
>> 2) udp - at 128 byte frames i am able to pass ~450k FPS at ~10% packet loss
>> .. at near wirespeed
>
> I have spent the last two days looking into this. My testing was based
> on LEDE master as of yesterday morning and my initial test setup was
> the following:
>
> Server (Intel NUC) <-> Gbit Switch <-> ZBT 2926 <-> Client
>
> The switch was tested and confirmed working at gigabit speeds. I used
> iperf for my tests, with a payload of 100B and configured port
> forwarding of UDP port 1203 from ZBT to client. I then ran the
> following command on the NUC in a loop:
>
> iperf -u -c 10.1.2.63 -t 3600 -d -p 1203 -l 100B -b 1000M
>
> I left the test running over night (around 16 hours of pushing data),
> but no error had been triggered as of this morning. Using bwm-ng, I
> saw that the NUC was able to push around 40 Mbit/s, which, based on
> earlier tests I have done where I have used the NUC as traffic
> generator, seemed a bit low. I don't know if it is relevant, but when
> capturing traffic (on both NUC and client) I saw pause packets quite
> frequently.
>
> Since this tests did not yield any result, and throughput was low, I
> looked at some of the setups where I have seen this error. In all
> setups, there is always something placed in front of the 2926 (a
> router, switch, ...). I therefore modified my test setup to be as
> follows:
>
> Server (Intel NUC) <-> Gbit Switch <-> ZBT 2926 #1 <-> ZBT 2926 #2 <-> Client
>
> I forwarded port 1203 on the new ZBT router and repeated the
> experiment. Using this setup, the NUC pushed about 260Mbit/s and I am
> reliably able to trigger the error within ~1000 seconds. The error is
> always seen on ZBT #1, and sometimes on ZBT #2. If I see the error on
> #2 it is always at a later time than #1, so it seems that the two
> routers somehow affect each other. When looking at the RX bandwidth on
> the client (using bwm-ng), I see that it is very bursty. I receive
> data at about 32Mbit/s, then no data for a while, then back to around
> 32 Mbit/s, and so on, until the error is triggered and switch (TX) on
> the router(s) die. Pause frames are also seen on both server and
> client in this experiment.
>
> After having found a way to reliably trigger the issue, I tested the
> patch provided by Mingyu. With this patch, the error is triggered much
> faster, usually after around 300 seconds.
>
> Mingyu, do you have any other ideas on what could be wrong or how to fix this?
>
> John, would it be possible to get access to your staged commit, so
> that I can repeat the test using your new code?
>
> Thanks for all the help,
> Kristian