[PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

Sun May 7 18:04:40 PDT 2023

On Sat May 6, 2023 at 2:35 AM AEST, Doug Anderson wrote:
> Hi,
>
> On Thu, May 4, 2023 at 7:36 PM Nicholas Piggin <npiggin at gmail.com> wrote:
> >
> > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > > From: Colin Cross <ccross at android.com>
> > >
> > > Implement a hardlockup detector that doesn't doesn't need any extra
> > > arch-specific support code to detect lockups. Instead of using
> > > something arch-specific we will use the buddy system, where each CPU
> > > watches out for another one. Specifically, each CPU will use its
> > > softlockup hrtimer to check that the next CPU is processing hrtimer
> > > interrupts by verifying that a counter is increasing.
> >
> > Powerpc's watchdog has an SMP checker, did you see it?
>
> No, I wasn't aware of it. Interesting, it seems to basically enable
> both types of hardlockup detectors together. If that really catches
> more lockups, it seems like we could do the same thing for the buddy
> system.

It doesn't catch more lockups. On powerpc we don't have a reliable
periodic NMI hence the SMP checker. But it is preferable that a CPU
detects its own lockup because NMI IPIs can result in crashes if
they are taken in certain critical sections.

> If people want, I don't think it would be very hard to make
> the buddy system _not_ exclusive of the perf system. Instead of having
> the buddy system implement the "weak" functions I could just call the
> buddy functions in the right places directly and leave the "weak"
> functions for a more traditional hardlockup detector to implement.
> Opinions?
>
> Maybe after all this lands, the powerpc watchdog could move to use the
> common code? As evidenced by this patch series, there's not really a
> reason for the SMP detection to be platform specific.

The powerpc SMP checker could certainly move to common code if
others wanted to use it.

> > It's all to
> > all rather than buddy which makes it more complicated but arguably
> > bit better functionality.
>
> Can you come up with an example crash where the "all to all" would
> work better than the simple buddy system provided by this patch?

CPU2                     CPU3
spin_lock_irqsave(A)     spin_lock_irqsave(B)
spin_lock_irqsave(B)     spin_lock_irqsave(A)

CPU1 will detect the lockup on CPU2, but CPU3's lockup won't be
detected so we don't get the trace that can diagnose the bug.

Another thing I actually found it useful for is you can easily
see if a core (i.e., all threads in the core) or a chip has
died. Maybe more useful when doing presilicon and bring up work
or firmware hacking, but still useful.

Thanks,
Nick

> It
> seems like they would be equivalent, but I could be missing something.
> Specifically they both need at least one non-locked-up CPU to detect a
> problem. If one or more CPUs is locked up then we'll always detect it.
> I suppose maybe you could provide a better error message at lockup
> time saying that several CPUs were locked up and that could be
> helpful. For now, I'd keep the current buddy system the way it is and
> if you want to provide a patch improving things to be "all-to-all" in
> the future that would be interesting to review.