am335x: 5.18.x: system stalling

Yegor Yefremov yegorslists at googlemail.com
Fri Jun 3 12:11:42 PDT 2022


On Fri, Jun 3, 2022 at 11:32 AM Arnd Bergmann <arnd at arndb.de> wrote:
>
> On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov
> <yegorslists at googlemail.com> wrote:
> > On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists at googlemail.com> wrote:
> > > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> > > > That confirms my statement that smp.c cannot be the culprit, and
> > > > appears to exonerate the pure asm pieces. I wonder if this is related
> > > > to insufficient asm constraints on the C helpers, or just the cost
> > > > model taking different decisions because the inline asm string is much
> > > > longer. In any case, this opens up a couple of avenues we could
> > > > explore to narrow this down further.
> > > >
> > > > As a quick check, can you try the below snippet applied onto the
> > > > broken current.h build?
> > > >
> > > > --- a/arch/arm/include/asm/current.h
> > > > +++ b/arch/arm/include/asm/current.h
> > > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > > > task_struct *get_current(void)
> > > >             "   b       . + (2b - 0b)                           \n\t"
> > > >             "   .popsection                                     \n\t"
> > > >  #endif
> > > > -           : "=r"(cur));
> > > > +           : "=r"(cur)
> > > > +           : "Q" (*(const unsigned long *)current_stack_pointer));
> > >
> > > Where is the current_stack_pointer defined?
> > >
> > > >  #elif __LINUX_ARM_ARCH__>= 7 || \
> > > >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> > > >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> > > >
> > > > Given that the problematic sequence appears to be in C code, could you
> > > > please confirm whether or not the stall is reproducible when all the
> > > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > > > are built into the kernel rather than built as modules? Also, which
> > > > GCC version are you using?
> > >
> > > For now, the CAN stack parts are built as modules. I'll try to compile them in.
> > >
> > > I'm using GCC 10.x
> >
> > I have tried your patch (see the attachment) and the system stalls.
>
> This is with only get_current() patched on top of the working
> f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor systems"), right?
>
> My best theory right now is that something in get_currnent() is wrong that
> causes it to return the wrong task pointer, which in turn leads to
> current->preempt_count to get out of sync. This may be related to the cppi41
> dmaengine tasklet and effectively disables further softirqs including the timer
> that triggers the RCU grace period.
>
> When we finally switch tasks to the cpufreq worker thread, softirqs
> can happen again because of the task switch, and at the next IRQ
> the timer detects the stall.
>
> > Will try GCC 11.x and also compiled-in drivers.
>
> Ok. Maybe make sure all drivers are built-in here. I see both the CAN
> layer and the cppi41 driver use softirqs, so to be on the safe side,
> try to get to a running kernel that has no modules loaded at all at
> the time you expect the stall.
>
>
> One thing that could possibly go wrong with get_current() would be that
> it fails to get patched for some reason, or it gets patched only after
> it was already called. Since you run on an ARMv7 CPU as opposed to
> an actual OMAP2410/ARM1136r0, it would then try to load the
> variable from the uninitialized TPIDRURO register. If that happens,
> the one-liner below should tell you exactly where, by triggering an
> Oops. You can apply the patch on top for testing, it should have no
> other effects if the patching part works correctly.
>
>         Anrd
>
> 8<---
>
> diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> index 2f9d79214b25..2a15832793c4 100644
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct
> *get_current(void)
>          */
>         cur = __builtin_thread_pointer();
>  #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
> -       asm("0: mrc p15, 0, %0, c13, c0, 3                      \n\t"
> +       asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>  #ifdef CONFIG_CPU_V6
>             "1:                                                 \n\t"
>             "   .subsection 1                                   \n\t"

With compiled-in drivers the system doesn't stall. All other tests and
related outputs will come next week.

Have a nice weekend.

Yegor



More information about the linux-arm-kernel mailing list