am335x: 5.18.x: system stalling

Tue Jun 7 01:55:30 PDT 2022

On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd at arndb.de> wrote:
> >
> > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> > <yegorslists at googlemail.com> wrote:
> > >
> > > With compiled-in drivers the system doesn't stall. All other tests and
> > > related outputs will come next week.
> >
> > Ah, nice!
> >
> > It's probably a reasonable assumption that the smp-patched get_current()
> > is (at least sometimes) broken in modules but working in the kernel itself.
> > I suppose that means in the worst case we can hot-fix the issue by
> > having an 'extern' version of get_current() for the case of
> > armv6+smp+module ;-)
> >
>
> I've coded something up along those lines, and pushed it to my
> am335x-stall-test branch.
>
> > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > mail. If that gives you an oops for the module case, then we know
> > that the patching doesn't work at all and you don't have to try anything
> > else, otherwise it's more likely that an incorrect instruction sequence
> > is patched in.
> >
>
> Yeah, I'd be really surprised if the patching misses some occurrences,
> so I have no clue what is going on here.
>
> Yegor, can you please try my branch with the original config (i.e.,
> slcan and ftdio as modules)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

@Arnd: I have applied your patch with this change:

asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap

But it revealed nothing new:

[   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
[   50.760834] rcu:     0-...!: (2600 ticks this GP)
idle=ec9/1/0x40000004 softirq=1852/1852 fqs=0
[   50.770407]  (t=2600 jiffies g=2577 q=17)
[   50.775046] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   50.786961] rcu:     Possible timer handling issue on cpu=0 timer-softirq=872
[   50.794429] rcu: rcu_sched kthread starved for 2600 jiffies! g2577
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   50.805403] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   50.814927] rcu: RCU grace-period kthread stack dump:
[   50.820464] task:rcu_sched       state:I stack:    0 pid:   10
ppid:     2 flags:0x00000000
[   50.830019] [<c0b683d4>] (__schedule) from [<c0b68d18>] (schedule+0x54/0xe8)
[   50.838470] [<c0b68d18>] (schedule) from [<c0b6f51c>]
(schedule_timeout+0xa8/0x210)
[   50.847208] [<c0b6f51c>] (schedule_timeout) from [<c01d85b4>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[   50.856631] [<c01d85b4>] (rcu_gp_fqs_loop) from [<c01dc4e4>]
(rcu_gp_kthread+0x138/0x30c)
[   50.865832] [<c01dc4e4>] (rcu_gp_kthread) from [<c0164df8>]
(kthread+0x13c/0x164)
[   50.874315] [<c0164df8>] (kthread) from [<c0100140>]
(ret_from_fork+0x14/0x34)
[   50.882477] rcu: Stack dump where RCU GP kthread last ran:
[   50.888512] NMI backtrace for cpu 0
[   50.892575] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.899912] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.906610] Workqueue: events dbs_work_handler
[   50.912202] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.921035] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)
[   50.928943] NMI backtrace for cpu 0
[   50.933084] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.940419] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.947083] Workqueue: events dbs_work_handler
[   50.952574] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.961334] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)

@Ard: I have tried your branch
(21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Yegor