am335x: 5.18.x: system stalling

Thu May 12 01:14:15 PDT 2022

On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony at atomide.com> wrote:
> Adding Ard and Arnd for vmap stack.

Thanks!

> * Yegor Yefremov <yegorslists at googlemail.com> [220511 14:16]:
> > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony at atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists at googlemail.com> [220504 10:35]:

>
> Maybe Ard and Arnd have some ideas what might be going wrong here.
> Basically anything trying to use a physical address on stack will
> fail in weird ways like we've seen for smc and wl1251.

For this, the first step should be to enable CONFIG_DMA_API_DEBUG.
If any device is getting the wrong DMA address for a stack variable,
this should print a helpful debug message to the console.

> > > > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > > > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > > > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > > > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > > > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > > > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > > > time, OOM is now expected behavior.
> > > > [   88.454859] rcu: RCU grace-period kthread stack dump:

I looked for a smoking gun in the backtrace, didn't really find anything,
so I'm guessing the problem is something that happened between the
last timer timer and the time it actually ran the rcu_gp_kthread, maybe
some DMA timeout in a device driver running with interrupts disabled.

> > > > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > > > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > > > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > > > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > > > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > > > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > > > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > > > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > > > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > > > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > > > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > > > [   88.882692]  kthread from ret_from_fork+0x14/0x28

The only thing I see that is slightly unusual here is that the timer
tick happened
exactly during the cpufreq transition. Is this always the same backtrace when
you run into the bug? What happens when you disable the omap3 cpufreq
driver or set it to run at a fixed frequency?

          Arnd