usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s

Wed May 10 16:50:52 PDT 2017

Hi,

On Wed, May 10, 2017 at 9:31 AM, Johan Hovold <johan at kernel.org> wrote:
> On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote:
>>
>> > Stefan Wahren <stefan.wahren at i2se.com> hat am 25. April 2017 um 20:11 geschrieben:
>> >
>> >
>> > > Stefan Wahren <stefan.wahren at i2se.com> hat am 22. April 2017 um 22:50 geschrieben:
>> > >
>> > >
>> > > Hi,
>> > >
>> > > > Eric Anholt <eric at anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
>> > > >
>> > > >
>> > > > Stefan Wahren <stefan.wahren at i2se.com> writes:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > >> Doug Anderson <dianders at chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>> > > > >>
>> > > > >>
>> > > > >> It's hard to know for sure that all of this time is really in
>> > > > >> urb_enqueue().  Possible we could have task switched out and been
>> > > > >> blocked elsewhere.  Using ftrace to get more fine-grained timings
>> > > > >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>> > > > >> friends here if you want to use trace_printk.
>> > > > >
>> > > > > i'm a newbie to ftrace, so i hope this would be helpful.
>> > > > >
>> > > > > # connect PL2303 to the onboard hub
>> > > > > # echo 0 > options/sleep-time
>> > > > > # echo 0 > function_profile_enabled
>> > > > > # echo 1 > function_profile_enabled
>> > > > > # ./usb_test
>> > > > > # Waiting for at least 20 seconds and then disconnect PL2303
>> > > > > # echo 0 > function_profile_enabled
>> > > > > # cat trace_stat/function0
>> > > > >
>> > > > >   Function                               Hit    Time            Avg             s^2
>> > > > >   --------                               ---    ----            ---             ---
>> > > > >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>> > > > >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>> > > > >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>> > > > >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>> > > > >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>> > > >
>> > > > If I'm reading this output right, we're spending half of our interrupt
>> > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the
>> > > > generic_handle_irq() chain), we'd be eating about half the CPU getting
>> > > > back out of the interrupt handler, right?
>> > > >
>> > > > I don't really know anything about DWC2 or USB, but is there any way we
>> > > > could mitigate the interrupt frequency with this hardware?  If nothing
>> > > > else, could we loop reading gintsts until it reads back 0?
>> > >
>> > > first of all i updated my kernel to 4.11rc7 and the issue still
>> > > occures. Today i had some time to investigate this issue further
>> > > and i made some interesting observations:
>> > >
>> > > 1. The lockup doesn't occure always after starting usb_test. In
>> > > rare cases i was able to run the program without lockup.
>> > > 2. In case the lockup occured we are always able to "rescue" the
>> > > BCM2835 from this state by sending some serial data to the PL2303.
>> >
>> > Based on this scenario i patched the interrupt routine to detect the
>> > interrupt storm and normal condition. So i can dump the global and
>> > host registers in both situations (bad and goodcase).
>> >
>> > Here is the diff between both register dumps, maybe someone see
>> > something interesting:
>> >
>> > --- badcase.txt     2017-04-25 18:02:59.000000000 +0000
>> > +++ goodcase.txt    2017-04-25 18:02:59.000000000 +0000
>> > @@ -4,12 +4,12 @@
>> >   dwc2_dump_global_registers: GAHBCFG        @0xCC850008 : 0x00000031
>> >   dwc2_dump_global_registers: GUSBCFG        @0xCC85000C : 0x20001700
>> >   dwc2_dump_global_registers: GRSTCTL        @0xCC850010 : 0x80000000
>> > - dwc2_dump_global_registers: GINTSTS        @0xCC850014 : 0x06000021
>> > + dwc2_dump_global_registers: GINTSTS        @0xCC850014 : 0x06000029
>> >   dwc2_dump_global_registers: GINTMSK        @0xCC850018 : 0xF300080E
>> > - dwc2_dump_global_registers: GRXSTSR        @0xCC85001C : 0x00070044
>> > + dwc2_dump_global_registers: GRXSTSR        @0xCC85001C : 0x000F0001
>> >   dwc2_dump_global_registers: GRXFSIZ        @0xCC850024 : 0x00000306
>> >   dwc2_dump_global_registers: GNPTXFSIZ      @0xCC850028 : 0x01000306
>> > - dwc2_dump_global_registers: GNPTXSTS       @0xCC85002C : 0x1D080100
>> > + dwc2_dump_global_registers: GNPTXSTS       @0xCC85002C : 0x18080100
>> > ...
>>
>> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason
>> for this interrupt storm?
>
> No, sorry.
>
> I suggested earlier that you could try commenting out the interrupt-URB
> submission in a naive attempt to narrow it down since that is an obvious
> difference from the ftdi driver, but then again so are the frequent ftdi
> status bulk messages (and any other timing differences).
>
> Seems like you need to dig into dwc2.

I haven't had a chance to dig and not sure when I'll have that time.
I'll keep this around in my inbox for now on the off chance that I
clean it out enough to get back to this.  Sorry!