[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Kristian Evensen kristian.evensen at gmail.com
Sat Aug 26 07:47:50 PDT 2017


Hello again,

On Sat, Aug 26, 2017 at 12:38 PM, Kristian Evensen
<kristian.evensen at gmail.com> wrote:
> Hi,
>
> On Sat, Aug 26, 2017 at 7:43 AM, Mingyu Li <igvtee at gmail.com> wrote:
>> Hi.
>>
>> i check the code again. found xmit_more can cause tx timeout. you can
>> reference this.
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg123334.html
>> so the patch should be like this. edit mtk_eth_soc.c
>>
>>         tx_num = fe_cal_txd_req(skb);
>>         if (unlikely(fe_empty_txd(ring) <= tx_num)) {
>> +                if (skb->xmit_more)
>> +                        fe_reg_w32(ring->tx_next_idx, FE_REG_TX_CTX_IDX0);
>>                 netif_stop_queue(dev);
>>                 netif_err(priv, tx_queued, dev,
>>                           "Tx Ring full when queue awake!\n");
>>
>> but i am not sure. maybe the pause frame cause the problem.
>
> Thanks for the patch. I tested it, but I unfortunately still see the
> error. I also added a print-statement inside the conditional and can
> see that the condition is never hit. I also don't see the "Tx Ring
> full"-message. One difference which I noticed now though, is that I
> don't see the bursty bandwidth pattern I described earlier (32, 0, 32,
> 0, ...). With your patch, it is always 32, 0, crash.

I spent some more time looking into this today and think I might have
been able to solve the issue. My test has been running for ~2 hours
now without any errors (before it would best-case work for 10-15
minutes), and I do not see any performance regressions. Before going
into detail, I should probably point out that I am not very familiar
with driver development, so my observation changes might not be
appropriate/correct :)

I guess our common theory is that something is causing TX interrupts
not to be enabled again. After reading up on NAPI
(https://wiki.linuxfoundation.org/networking/napi), it seems that the
recommended way of using NAPI on clear-on-write devices (like the
RT5350) is to use NAPI for RX and do TX in the interrupt handler. In
the current driver, both TX and RX is handled in the NAPI-callback
fe_poll(). I have modified the driver to split RX and TX, so now
fe_poll() only deals with RX and TX is dealt with in fe_handle_irq().
I have attached the (messy) patch I am currently testing. If this is
an OK solution, I will clean up the patch and submit is to the list. I
will leave my tests running overnight and report back if anything pops
up.

I guess that Johns new driver is the future for mtk_sock_eth, but I
believe that fixing this issue for the current driver is worthwhile.
As things are now, is it possible to DDOS RT5350-based routers running
LEDE 17.01 by just sending the correct type of traffic.

-Kristian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-FIX-Move-TX-out-of-Napi.patch
Type: text/x-patch
Size: 3485 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/lede-dev/attachments/20170826/703a49c0/attachment-0001.bin>


More information about the Lede-dev mailing list