[PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c
Doug Anderson
dianders at chromium.org
Fri May 19 10:22:21 PDT 2023
Hi,
On Thu, May 11, 2023 at 8:46 AM Petr Mladek <pmladek at suse.com> wrote:
>
> > @@ -111,6 +125,11 @@ static void watchdog_hardlockup_interrupt_count(void)
> >
> > void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > {
> > + if (__this_cpu_read(watchdog_hardlockup_touch)) {
> > + __this_cpu_write(watchdog_hardlockup_touch, false);
> > + return;
> > + }
>
> If we clear watchdog_hardlockup_touch() here then
> watchdog_hardlockup_check() won't be called yet another
> watchdog_hrtimer_sample_threshold perior.
>
> It means that any touch will cause ignoring one full period.
> The is_hardlockup() check will be done after full two periods.
>
> It is not ideal, see below.
>
> > +
> > /*
> > * Check for a hardlockup by making sure the CPU's timer
> > * interrupt is incrementing. The timer interrupt should have
> > diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
> > index 9be90b2a2ea7..547917ebd5d3 100644
> > --- a/kernel/watchdog_perf.c
> > +++ b/kernel/watchdog_perf.c
> > @@ -112,11 +98,6 @@ static void watchdog_overflow_callback(struct perf_event *event,
> > /* Ensure the watchdog never gets throttled */
> > event->hw.interrupts = 0;
> >
> > - if (__this_cpu_read(watchdog_nmi_touch) == true) {
> > - __this_cpu_write(watchdog_nmi_touch, false);
> > - return;
> > - }
>
> The original code looks wrong. arch_touch_nmi_watchdog() caused
> skipping only one period of the perf event.
>
> I would expect that it caused restarting the period,
> something like:
>
> if (__this_cpu_read(watchdog_nmi_touch) == true) {
> /*
> * Restart the period after which the interrupt
> * counter is checked.
> */
> __this_cpu_write(nmi_rearmed, 0);
> __this_cpu_write(last_timestamp, now);
> __this_cpu_write(watchdog_nmi_touch, false);
> return;
> }
>
> By other words, we should restart the period in the very next perf
> event after the watchdog was touched.
>
> That said, the new code looks better than the original.
> IMHO, the original code was prone to false positives.
I had a little bit of a hard time following, but I _think_ the "tl;dr"
of all the above is that my change is fine. If I misunderstood, please
yell.
> Best Regards,
> Petr
>
> PS: It might be worth fixing this problem in a separate patch at the
> beginning of this patchset. It might be a candidate for stable
> backports.
Done. It's now its own patch and early in the series.
More information about the linux-arm-kernel
mailing list