[PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

Mon Jan 14 20:53:40 EST 2013

On Mon, Jan 14, 2013 at 4:25 PM, Frederic Weisbecker <fweisbec at gmail.com> wrote:
> 2013/1/15 Colin Cross <ccross at android.com>:
>> On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker <fweisbec at gmail.com> wrote:
>>> I believe this is pretty much what the RCU stall detector does
>>> already: checks for other CPUs being responsive. The only difference
>>> is on how it checks that. For RCU it's about checking for CPUs
>>> reporting quiescent states when requested to do so. In your case it's
>>> about ensuring the hrtimer interrupt is well handled.
>>>
>>> One thing you can do is to enqueue an RCU callback (cal_rcu()) every
>>> minute so you can force other CPUs to report quiescent states
>>> periodically and thus check for lockups.
>>
>> That's a good point, I'll take a look at using that.  A minute is too
>> long, some SoCs have maximum HW watchdog periods of under 30 seconds,
>> but a call_rcu every 10-20 seconds might be sufficient.
>
> Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly.

After considering this, I think the hrtimer watchdog is more useful.
RCU stalls are not usually panic events, and I wouldn't want to add a
panic on every RCU stall.  The lack of stack traces on the affected
cpu makes a panic important.  I'm planning to add an ARM DBGPCSR panic
handler, which will be able to dump the PC of a stuck cpu even if it
is not responding to interrupts.  kexec or kgdb on panic might also
allow some inspection of the stack on stuck cpu.

Failing to process interrupts is a much more serious event than an RCU
stall, and being able to detect them separately may be very valuable
for debugging.