[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Thu Nov 9 11:42:35 PST 2017

On Thu, Nov 9, 2017 at 5:21 PM, Kristian Evensen
<kristian.evensen at gmail.com> wrote:
> I have been hammering away on this issue during the day, and it seems
> that the DMA engine, TX, etc. works just fine. However, for some
> reason, the port with the router that has hung is able to stop the
> whole switch. If I disable the port (or disconnect the cable), then TX
> works again and I can for example reach 192.168.1.1 from 192.168.1.2
> in my testbed. When running ping (from 192.168.1.2 to 192.168.1.1)
> while disconnecting the cable, the first packets had a very high RTT
> (~20ms). Running tcpdump showed that the reply arrived immediately, so
> it seems the packets are stuck in a TX buffer for a really long time.
> Could it be that there is a cache or something internally on the
> switch that is causing packets to be held back, and that this cache is
> invalidated and buffers flushed when I disable the port? I cleared
> switch, DIP and SIP tables without any effect.
>
> If I enable the port, then the problem appears again after a little
> while (~30 seconds).

I replaced the 3526 with other devices containing the mt7530 switch
(both mt7621 and mt7623-based boards), and the issues seems to be
related to the switch rather than the SoC. I am able to reliably
trigger the timeout on all devices I have tested, both running
proprietary drivers/firmware and LEDE. I guess this points to that
there is some traffic pattern or network behavior that triggers an
error in the MT7530 and causes TX to freeze. Restarting the ports
makes the switch work again, but as long as the "bad" device is
connected to the mt7530 then it is just a matter of time before the
timeout is back.

-Kristian