am335x: 5.18.x: system stalling

Ard Biesheuvel ardb at kernel.org
Thu Jun 2 03:37:32 PDT 2022


On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists at googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb at kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists at googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb at kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists at googlemail.com> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb at kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb at kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd at arndb.de> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > <yegorslists at googlemail.com> wrote:
> > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd at arndb.de> wrote:
> > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > >
> > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > >
> > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > >
> > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > >
> > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > only those does not change the behavior.
> > > > > > > >
> > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > problem?
> > > > > > > >
> > > > > > >
> > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > pointless SMP check on secondary startup path").
> > > > > > >
> > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > see anything obviously wrong with those either.
> > > > > >
> > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > why it could make a difference, but I think it's worth a try.
> > > > >
> > > > > The link to your repo is missing.
> > > > >
> > > >
> > > > Oops, sorry :-)
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > >
> > > I have tested your branch and it stalls:
> > >
> >
> > OK, thanks for verifying.
>
> My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
>
> percpu.h: sporadic stalls
> current.h: always stalls
> assembler.h: no stalls
> smp.c: no stalls
>

So you mean that applying the changes to each of those files in
isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
produces those results, right?

That confirms my statement that smp.c cannot be the culprit, and
appears to exonerate the pure asm pieces. I wonder if this is related
to insufficient asm constraints on the C helpers, or just the cost
model taking different decisions because the inline asm string is much
longer. In any case, this opens up a couple of avenues we could
explore to narrow this down further.

As a quick check, can you try the below snippet applied onto the
broken current.h build?

--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
task_struct *get_current(void)
            "   b       . + (2b - 0b)                           \n\t"
            "   .popsection                                     \n\t"
 #endif
-           : "=r"(cur));
+           : "=r"(cur)
+           : "Q" (*(const unsigned long *)current_stack_pointer));
 #elif __LINUX_ARM_ARCH__>= 7 || \
       !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
       (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))

Given that the problematic sequence appears to be in C code, could you
please confirm whether or not the stall is reproducible when all the
pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
are built into the kernel rather than built as modules? Also, which
GCC version are you using?



More information about the linux-arm-kernel mailing list