[PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus
Colin Cross
ccross at android.com
Thu Jan 10 12:27:28 EST 2013
On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus <dzickus at redhat.com> wrote:
> On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
>> Emulate NMIs on systems where they are not available by using timer
>> interrupts on other cpus. Each cpu will use its softlockup hrtimer
>> to check that the next cpu is processing hrtimer interrupts by
>> verifying that a counter is increasing.
>>
>> This patch is useful on systems where the hardlockup detector is not
>> available due to a lack of NMIs, for example most ARM SoCs.
>
> I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> reserving an IRQ line as 'special' (can not be masked). Not sure if that
> is something worth looking at here (or even possible).
>
>> Without this patch any cpu stuck with interrupts disabled can
>> cause a hardware watchdog reset with no debugging information,
>> but with this patch the kernel can detect the lockup and panic,
>> which can result in useful debugging info.
>
> <SNIP>
>> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
>> +static int is_hardlockup_other_cpu(int cpu)
>> +{
>> + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
>> +
>> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
>> + return 1;
>> +
>> + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
>> + return 0;
>
> Will this race with the other cpu you are checking? For example if cpuA
> just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
> hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
> cpuA is stuck?
cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
However, there may be a similar race condition during hotplug if cpuB
updates hrtimer_interrupts_saved for cpuA, then goes offline, then
cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
hrtimer_interrupts. I think this can be solved by setting
watchdog_nmi_touch for the next cpu when a cpu goes online or offline.
>> +}
>> +
>> +static void watchdog_check_hardlockup_other_cpu(void)
>> +{
>> + int cpu;
>> + cpumask_t cpus = watchdog_cpus;
>> +
>> + /*
>> + * Test for hardlockups every 3 samples. The sample period is
>> + * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
>> + * watchdog_thresh (over by 20%).
>> + */
>> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
>> + return;
>> +
>> + /* check for a hardlockup on the next cpu */
>> + cpu = cpumask_next(smp_processor_id(), &cpus);
>> + if (cpu >= nr_cpu_ids)
>> + cpu = cpumask_first(&cpus);
>> + if (cpu == smp_processor_id())
>> + return;
>> +
>> + smp_rmb();
>> +
>> + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
>> + per_cpu(watchdog_nmi_touch, cpu) = false;
>> + return;
>> + }
>
> Same race here. Usually touch_nmi_watchdog is reserved for those
> functions that plan on disabling interrupts for a while. cpuB could set
> cpuA's watchdog_nmi_touch to false here expecting not to revisit this
> variable for another couple of seconds. While cpuA could read this
> variable milliseconds later after cpuB sets it and falsely assume there is
> a lockup?
>
> Perhaps I am misreading the code?
Again, cpuA won't ever read its own watchdog_nmi_touch variable, only
cpuB will. The only variables cpuA updates for itself is
hrtimer_interrupts or setting watchdog_nmi_touch to true.
hrtimer_interrupts_saved and setting watchdog_nmi_touch to false are
done by the cpu watching over cpuA, so the only races here are when a
cpu goes offline and a different cpu starts watching over cpuA.
> If not, I don't have a good idea on how to solve those races off the top of my
> head unfortunately.
>
> Cheers,
> Don
More information about the linux-arm-kernel
mailing list