[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Kristian Evensen kristian.evensen at gmail.com
Wed Nov 8 13:56:40 PST 2017


Hello,

It turns out that the assumption that the "transmit timed out"-issue
was related to pause frames/flow control was incorrect. I have
recently started to see the error again, with flow control disabled.
However, unlike last time, I am now able to reliably trigger the
issue.

The timeout seems to be triggered by connectivity problems between
MT7621-based routers (not sure if it applies to other devices with the
MT7530 switch) and the next hop. I checked each client connected to
some of the routers exhibiting this issue, and turns out that some had
bad cables, etc.. In order to check the theory in a more controlled
fashion, I set up the following small testbed:

NUC (192.168.1.1) <-> (192.168.1.2) ZBT 3526 (192.168.2.1) <->
(192.168.2.2) ZBT 2626 (192.168.3.1) <-> (192.168.3.2) Client

I then configured port forwarding from the 3526 and all the way to the
client and hammered the client with small UDP packets. Then, at random
points, I intentionally hung the kernel on the 2626 by triggering an
RCU error causing a stall. L2 was still up, but the 2626 does not
reply to any packets, including ARP (so the neighbor-table entry for
192.168.2.2 is quickly lost). More or less as soon as the kernel hung,
the transmit timeout-error message started showing up. If I restart
networking or enable/disable the ports, then everything works fine for
a bit (I can for example ping 192.168.1.1 from 192.168.1.2), but after
some time the error appears again.

I have been trying to solve this myself for a couple of days, but I am
starting to run out of idea. Could it be that there is some traffic
destined for the client (via. the 2626) that gets stuck in the TX
queue on the 3526? Any help, pointers on where to look or ideas for
what could be wrong would be much appreciated.

Thanks in advance for the help,
Kristian



More information about the Lede-dev mailing list