am335x: 5.18.x: system stalling

Arnd Bergmann arnd at arndb.de
Fri Aug 12 00:35:09 PDT 2022


On Tue, Jun 7, 2022 at 10:55 AM Yegor Yefremov
<yegorslists at googlemail.com> wrote:
> On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb at kernel.org> wrote:
> > On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd at arndb.de> wrote:
> > > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov <yegorslists at googlemail.com> wrote:
> > > >
> > > > With compiled-in drivers the system doesn't stall. All other tests and
> > > > related outputs will come next week.
> > >
> > > Ah, nice!
> > >
> > > It's probably a reasonable assumption that the smp-patched get_current()
> > > is (at least sometimes) broken in modules but working in the kernel itself.
> > > I suppose that means in the worst case we can hot-fix the issue by
> > > having an 'extern' version of get_current() for the case of
> > > armv6+smp+module ;-)
> > >
> >
> > I've coded something up along those lines, and pushed it to my
> > am335x-stall-test branch.
> >
> > > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > > mail. If that gives you an oops for the module case, then we know
> > > that the patching doesn't work at all and you don't have to try anything
> > > else, otherwise it's more likely that an incorrect instruction sequence
> > > is patched in.
> > >
> >
> > Yeah, I'd be really surprised if the patching misses some occurrences,
> > so I have no clue what is going on here.
> >
> > Yegor, can you please try my branch with the original config (i.e.,
> > slcan and ftdio as modules)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
>
> @Arnd: I have applied your patch with this change:
>
> asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>
> But it revealed nothing new:
>
> [   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
>
> @Ard: I have tried your branch
> (21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Getting back to this old thread, as we never found out what is
actually going on.

It seems we are still stuck trying to figure out why a kernel with ARMv6
support and SMP patching is broken, or if the same bug might also affect
other configurations without ARMv6 support. This is of course very
unfortunate, but unless someone has an idea for how to debug the problem
further, I suppose we should at least prevent that broken configuration and
disallow enabling CONFIG_SMP in combination with ARMv6 (pre-ARMv6K)
CPUs, to keep others from running into the same problem.

Any other suggestions?

        Arnd



More information about the linux-arm-kernel mailing list