am335x: 5.18.x: system stalling
Yegor Yefremov
yegorslists at googlemail.com
Tue Jun 7 01:55:30 PDT 2022
On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd at arndb.de> wrote:
> >
> > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> > <yegorslists at googlemail.com> wrote:
> > >
> > > With compiled-in drivers the system doesn't stall. All other tests and
> > > related outputs will come next week.
> >
> > Ah, nice!
> >
> > It's probably a reasonable assumption that the smp-patched get_current()
> > is (at least sometimes) broken in modules but working in the kernel itself.
> > I suppose that means in the worst case we can hot-fix the issue by
> > having an 'extern' version of get_current() for the case of
> > armv6+smp+module ;-)
> >
>
> I've coded something up along those lines, and pushed it to my
> am335x-stall-test branch.
>
> > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > mail. If that gives you an oops for the module case, then we know
> > that the patching doesn't work at all and you don't have to try anything
> > else, otherwise it's more likely that an incorrect instruction sequence
> > is patched in.
> >
>
> Yeah, I'd be really surprised if the patching misses some occurrences,
> so I have no clue what is going on here.
>
> Yegor, can you please try my branch with the original config (i.e.,
> slcan and ftdio as modules)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
@Arnd: I have applied your patch with this change:
asm("0: .long 0xe7f001f2 \n\t" // BUG() trap
But it revealed nothing new:
[ 50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
[ 50.760834] rcu: 0-...!: (2600 ticks this GP)
idle=ec9/1/0x40000004 softirq=1852/1852 fqs=0
[ 50.770407] (t=2600 jiffies g=2577 q=17)
[ 50.775046] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 50.786961] rcu: Possible timer handling issue on cpu=0 timer-softirq=872
[ 50.794429] rcu: rcu_sched kthread starved for 2600 jiffies! g2577
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 50.805403] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 50.814927] rcu: RCU grace-period kthread stack dump:
[ 50.820464] task:rcu_sched state:I stack: 0 pid: 10
ppid: 2 flags:0x00000000
[ 50.830019] [<c0b683d4>] (__schedule) from [<c0b68d18>] (schedule+0x54/0xe8)
[ 50.838470] [<c0b68d18>] (schedule) from [<c0b6f51c>]
(schedule_timeout+0xa8/0x210)
[ 50.847208] [<c0b6f51c>] (schedule_timeout) from [<c01d85b4>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[ 50.856631] [<c01d85b4>] (rcu_gp_fqs_loop) from [<c01dc4e4>]
(rcu_gp_kthread+0x138/0x30c)
[ 50.865832] [<c01dc4e4>] (rcu_gp_kthread) from [<c0164df8>]
(kthread+0x13c/0x164)
[ 50.874315] [<c0164df8>] (kthread) from [<c0100140>]
(ret_from_fork+0x14/0x34)
[ 50.882477] rcu: Stack dump where RCU GP kthread last ran:
[ 50.888512] NMI backtrace for cpu 0
[ 50.892575] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[ 50.899912] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 50.906610] Workqueue: events dbs_work_handler
[ 50.912202] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[ 50.921035] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)
[ 50.928943] NMI backtrace for cpu 0
[ 50.933084] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[ 50.940419] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 50.947083] Workqueue: events dbs_work_handler
[ 50.952574] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[ 50.961334] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)
@Ard: I have tried your branch
(21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.
Yegor
More information about the linux-arm-kernel
mailing list