[PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

Tue Jan 22 21:38:59 EST 2013

On Mon, Jan 14, 2013 at 4:30 PM, Colin Cross <ccross at android.com> wrote:
> On Mon, Jan 14, 2013 at 4:25 PM, Andrew Morton
> <akpm at linux-foundation.org> wrote:
>> On Mon, 14 Jan 2013 16:19:23 -0800
>> Colin Cross <ccross at android.com> wrote:
>>
>>> >> +static void watchdog_check_hardlockup_other_cpu(void)
>>> >> +{
>>> >> +     unsigned int next_cpu;
>>> >> +
>>> >> +     /*
>>> >> +      * Test for hardlockups every 3 samples.  The sample period is
>>> >> +      *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
>>> >> +      *  watchdog_thresh (over by 20%).
>>> >> +      */
>>> >> +     if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
>>> >> +             return;
>>> >
>>> > The hardwired interval Seems Wrong.  watchdog_thresh is tunable at runtime.
>>> >
>>> > The comment could do with some fleshing out.  *why* do we want to test
>>> > at an interval "slightly over watchdog_thresh"?  What's going on here?
>>>
>>> I'll reword it.  We don't want to be slightly over watchdog_thresh,
>>> ideally we would be exactly at watchdog_thresh.  However, since this
>>> relies on the hrtimer interrupts that are scheduled at watchdog_thresh
>>> * 2 / 5, there is no multiple of hrtimer_interrupts that will result
>>> in watchdog_thresh.  watchdog_thresh * 2 / 5 * 3 (watchdog_thresh *
>>> 1.2) is the closest I can get to testing for a hardlockup once every
>>> watchdog_thresh seconds.
>>
>> It needs more than rewording, doesn't it?  What happens if watchdog_thresh is
>> altered at runtime?
>
> I'm not sure what you mean.  If watchdog_thresh changes, the next
> hrtimer interrupt on each cpu will move the following hrtimer
> interrupt forward by the new watchdog_thresh * 2 / 5.  There may be a
> single cycle of watchdog checks at an intermediate period, but nothing
> bad should happen.
>
> This code doesn't ever deal with watchdog_thresh directly, it is only
> counting hrtimer interrupts.  3 hrtimer interrupts is always a
> reasonable approximation of watchdog_thresh.

I think I see the race you were referring to.  When the hrtimer
interrupt fires on the first cpu after changing the watchdog thresh,
it will see the new value and starting firing the timer at the new
rate.  The other cpus may not see the new value for as long as the old
sample period.  If the new threshold is more than 3 times lower than
the old value, one cpu could easily get 3 hrtimer interrupts while
another cpu doesn't get any, triggering the lockup detector.

The exact same issue is already present in the existing NMI detector,
which is conceptually very similar to this one.  It has two
asynchronous timers, the NMI perf counter and the hrtimer.  If one
timer sees the new threshold before the other their relative rates
will be wrong and the lockup detector may fire.  Setting
watchdog_nmi_touch is not good enough, as one interrupt could fire
enough to trigger the hardlockup detector twice.  The only fix I can
think of is to set a timestamp far enough in the future to catch all
the ordering problems, and ignore lockups until sched_clock() is
higher than the timestamp.  I think it would need to be a delay of
twice the larger of the old and new watchdog_thresh values.

But its worse than that, the NMI perf counter is never updated when
watchdog_thresh changes, so raising watchdog_thresh more than 2.5x may
trigger the NMI detector.

I'll send a separate patch for the first problem, but I don't have
anything to test resetting the NMI counter on so I'll leave that issue
for someone else.