cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs

Doug Anderson dianders at chromium.org
Thu May 4 15:16:23 PDT 2023


Hi,

On Tue, May 2, 2023 at 8:23 AM Petr Mladek <pmladek at suse.com> wrote:
>
> On Mon 2023-05-01 08:24:46, Douglas Anderson wrote:
> > From: Colin Cross <ccross at android.com>
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > --- /dev/null
> > +++ b/kernel/watchdog_buddy_cpu.c
> > +int watchdog_nmi_enable(unsigned int cpu)
> > +{
> > +     /*
> > +      * The new CPU will be marked online before the first hrtimer interrupt
> > +      * runs on it.
>
> It does not need to be the first hrtimer interrupt. The CPU might have
> been offlined/onlined repeatedly. The counter might have any value.
>
> > +      * If another CPU tests for a hardlockup on the new CPU
> > +      * before it has run its first hrtimer, it will get a false positive.
> > +      * Touch the watchdog on the new CPU to delay the first check for at
> > +      * least 3 sampling periods to guarantee one hrtimer has run on the new
> > +      * CPU.
> > +      */

OK, I've updated the above comment to:

/*
 * The new CPU will be marked online before the hrtimer interrupt
 * gets a chance to run on it. If another CPU tests for a
 * hardlockup on the new CPU before it has run its the hrtimer
 * interrupt, it will get a false positive. Touch the watchdog on
 * the new CPU to delay the check for at least 3 sampling periods
 * to guarantee one hrtimer has run on the new CPU.
 */

> > +     per_cpu(watchdog_touch, cpu) = true;
>
> We should touch also the next_cpu:
>
>         /*
>          * We are going to check the next CPU. Our watchdog_hrtimer
>          * need not be zero if the CPU has already been online earlier.
>          * Touch the watchdog on the next CPU to avoid false positive
>          * if we try to check it in less then 3 interrupts.
>          */
>         next_cpu = watchdog_next_cpu(cpu);
>         if (next_cpu < nr_cpu_ids)
>                 per_cpu(watchdog_touch, next_cpu) = true;
>
> Alternative would be to clear watchdog_hrtimer. But it would kind-of
> affect also the softlockup detector.

Looks reasonable. I've incorporated it.



More information about the linux-arm-kernel mailing list