[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Sat Aug 19 15:30:53 PDT 2017

On 20/08/17 00:07, Kristian Evensen wrote:
>
> On Sat, 19 Aug 2017 at 23:52, John Crispin <john at phrozen.org 
> <mailto:john at phrozen.org>> wrote:
>
>
>
>     On 19/08/17 23:13, Kristian Evensen wrote:
>     > Hi both,
>     >
>     > On Sat, 19 Aug 2017 at 20:16, John Crispin <john at phrozen.org
>     <mailto:john at phrozen.org>
>     > <mailto:john at phrozen.org <mailto:john at phrozen.org>>> wrote:
>     >
>     >     Hi All,
>     >
>     >     i have a staged commit on my laptop that makes all the
>     (upstream)
>     >     ethernet fixes that i pushed to mt7623 work on mt7621.
>     please hang on
>     >     for a few more days till i finished testing the support.
>     this will add
>     >     latest upstream ethernet support + DSA
>     >
>     >
>     > Thanks for the follow-up Mingyu and the info John. I have not
>     had time
>     > to investigate the issue further (holiday backlog ...), but will
>     start
>     > working on trying to reproduce it at the end of next week. I have
>     > deployed the patch to some routers and have not seen any
>     regressions,
>     > but I would like to know how to reliably trigger the issue before
>     > concluding :)
>     >
>     > John, does your commits include a fix similar to what Mingyu
>     sent me?
>
>
>     with my fixes the mt7623 passes a 48h stress test running the unit
>     on a
>     iperf test with 200 parallel flows at full wire speed. once backported
>     to mt7621 i am pretty confident that the fix will yield the maximum
>     stable performance we can get.
>
>
> Thanks! I will focus on finding a way to reproduce the issue then, and 
> then test Mingyu and your patches. Out of curiosity, when you say 
> maximum stable performance, does that mean that the hwnat will also be 
> backported?
>
> Kristian
>

correct, in my testing i have been ... with 200 parallel flows ... on 
MT7623, we'll have to find out what mt7621 can achieve ... this is all 
using hwnat ...
1) tcp - at 50 byte frames i am able to pass 720 MBit which is > 1M FPS
2) udp - at 128 byte frames i am able to pass ~450k FPS at ~10% packet 
loss .. at near wirespeed

in a nutshell ... UDP has no TC. due to this, the lower the frame size, 
the higher the packet loss. the HW NAT will assert the FC bit inside the 
GMAC. when using TCP this will cause back pressure to make the OS stall 
the connection and reduce max throughput. in contrast, when using UDP 
you'll see packet loss go up instead of dropping throughput as there is 
no TC.

also i have managed to make HW QoS work, still working on the best way 
to integrate this with fw3. HW QoS doe perform remarkably well on 
mt7623. when saturating the link doing lan->wan traffic i am able to ssh 
into the unit and only have a slight subjective increase in latency.

    John

>
>           John
>
>     >
>     > Kristian
>     >
>     >
>     >
>     >          John
>     >
>     >
>     >     On 19/08/17 17:06, Mingyu Li wrote:
>     >     > Hi Kristian.
>     >     >
>     >     > does this patch works?
>     >     >
>     >     > 2017-07-24 23:45 GMT+08:00 Mingyu Li <igvtee at gmail.com
>     <mailto:igvtee at gmail.com>
>     >     <mailto:igvtee at gmail.com <mailto:igvtee at gmail.com>>>:
>     >     >> i guess more other interrupts maybe cause the problem.
>     because the
>     >     >> ethernet receive flow is interrupt by other hardware. so
>     use sd
>     >     card,
>     >     >> wifi or usb can generate interrupts.
>     >     >>
>     >     >> 2017-07-24 17:19 GMT+08:00 Kristian Evensen
>     >     <kristian.evensen at gmail.com
>     <mailto:kristian.evensen at gmail.com>
>     <mailto:kristian.evensen at gmail.com
>     <mailto:kristian.evensen at gmail.com>>>:
>     >     >>> Hi,
>     >     >>>
>     >     >>> On Mon, Jul 24, 2017 at 4:02 AM, Mingyu Li
>     <igvtee at gmail.com <mailto:igvtee at gmail.com>
>     >     <mailto:igvtee at gmail.com <mailto:igvtee at gmail.com>>> wrote:
>     >     >>>> i guest the problem is there are some tx data not free.
>     but tx
>     >     >>>> interrupt is clean. cause tx timeout. the old code will
>     free data
>     >     >>>> first then clean interrupt. but there maybe new data arrive
>     >     after free
>     >     >>>> data before clean interrupt.
>     >     >>>> so change it to clean interrupt first then clean all tx
>     data(
>     >     also
>     >     >>>> remove the budget limit). if new tx data arrive. hardware
>     >     will set tx
>     >     >>>> interrupt flag. then we will free it next time.
>     >     >>>> i also apply this to rx flow.
>     >     >>> Thanks for the detailed explanation. I have deployed an
>     image
>     >     with the
>     >     >>> patch to some of the routers showing this issue, so lets
>     wait
>     >     and see.
>     >     >>> Of course, all routers have been stable for the last
>     couple of
>     >     days
>     >     >>> (including before the weekend) now, so I will let them
>     run for
>     >     a week
>     >     >>> or so and then report back.
>     >     >>>
>     >     >>> In order to ease testing and make it more controlled, do you
>     >     have any
>     >     >>> suggestions for how to trigger the error? Is it "just" a
>     >     timing issue
>     >     >>> or should I be able to trigger it with for example a
>     specific
>     >     traffic
>     >     >>> pattern?
>     >     >>>
>     >     >>> -Kristian
>     >     > _______________________________________________
>     >     > Lede-dev mailing list
>     >     > Lede-dev at lists.infradead.org
>     <mailto:Lede-dev at lists.infradead.org>
>     <mailto:Lede-dev at lists.infradead.org
>     <mailto:Lede-dev at lists.infradead.org>>
>     >     > http://lists.infradead.org/mailman/listinfo/lede-dev
>     >
>