[PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c

Doug Anderson dianders at chromium.org
Fri May 19 10:22:21 PDT 2023


Hi,

On Thu, May 11, 2023 at 8:46 AM Petr Mladek <pmladek at suse.com> wrote:
>
> > @@ -111,6 +125,11 @@ static void watchdog_hardlockup_interrupt_count(void)
> >
> >  void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >  {
> > +     if (__this_cpu_read(watchdog_hardlockup_touch)) {
> > +             __this_cpu_write(watchdog_hardlockup_touch, false);
> > +             return;
> > +     }
>
> If we clear watchdog_hardlockup_touch() here then
> watchdog_hardlockup_check() won't be called yet another
> watchdog_hrtimer_sample_threshold perior.
>
> It means that any touch will cause ignoring one full period.
> The is_hardlockup() check will be done after full two periods.
>
> It is not ideal, see below.
>
> > +
> >       /*
> >        * Check for a hardlockup by making sure the CPU's timer
> >        * interrupt is incrementing. The timer interrupt should have
> > diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
> > index 9be90b2a2ea7..547917ebd5d3 100644
> > --- a/kernel/watchdog_perf.c
> > +++ b/kernel/watchdog_perf.c
> > @@ -112,11 +98,6 @@ static void watchdog_overflow_callback(struct perf_event *event,
> >       /* Ensure the watchdog never gets throttled */
> >       event->hw.interrupts = 0;
> >
> > -     if (__this_cpu_read(watchdog_nmi_touch) == true) {
> > -             __this_cpu_write(watchdog_nmi_touch, false);
> > -             return;
> > -     }
>
> The original code looks wrong. arch_touch_nmi_watchdog() caused
> skipping only one period of the perf event.
>
> I would expect that it caused restarting the period,
> something like:
>
>         if (__this_cpu_read(watchdog_nmi_touch) == true) {
>                 /*
>                  * Restart the period after which the interrupt
>                  * counter is checked.
>                  */
>                 __this_cpu_write(nmi_rearmed, 0);
>                 __this_cpu_write(last_timestamp, now);
>                 __this_cpu_write(watchdog_nmi_touch, false);
>                 return;
>         }
>
> By other words, we should restart the period in the very next perf
> event after the watchdog was touched.
>
> That said, the new code looks better than the original.
> IMHO, the original code was prone to false positives.

I had a little bit of a hard time following, but I _think_ the "tl;dr"
of all the above is that my change is fine. If I misunderstood, please
yell.


> Best Regards,
> Petr
>
> PS: It might be worth fixing this problem in a separate patch at the
>     beginning of this patchset. It might be a candidate for stable
>     backports.

Done. It's now its own patch and early in the series.



More information about the linux-arm-kernel mailing list