watchdogs and kdump
dzickus at redhat.com
Fri Oct 28 09:39:54 EDT 2011
On Thu, Oct 27, 2011 at 10:43:58PM +0100, Pádraig Brady wrote:
> On 10/27/2011 09:30 PM, Don Zickus wrote:
> > Hi,
> > I was assisting a customer the other day debugging a kdump problem, when we
> > noticed the real problem was the hardware watchdog was firing and
> > rebooting the box.
> > Of course, this can be inconvienant if the panic happens right before the
> > watchdog is supposed to be kicked, leading to a spontaneous reboot before
> > the second kernel finishes booting and loading the watchdog module.
> > I was trying to think of a way to solve this and thought, one way to
> > minimize the problem is to kick the watchdog before we jump into the kdump
> > kernel. Another way is to disable the watchdog entirely, but that doesn't
> > work on all hardware I believe.
> > Anyway, I was posting on the watchdog mailing list to see if anyone had any
> > ideas that might help. And if my above idea to kick the watchdog before
> > jumping into the kdump kernel seems ok, then an api would need to be
> > developed.
> > I am willing to do any coding and testing necessary, but before I did, I
> > wanted help to get a direction to go in first.
> > Thoughts?
> Seems like the appropriate thing to do is to call all the
> reboot notifiers that each watchdog registers.
> Since one is not doingn a full SYS_RESTART (SYS_DOWN) though,
> i.e. not running through the BIOS code again,
> it might be worth having a different SYS_JUMP code in notifier.h
> that would allow you to kick rather than stop the watchdogs
> as the reboot notifiers generally do at the moment.
That is an interesting idea. Not sure if calling a blocking notifier in
the kdump path would be acceptable to the kexec folks. Then again using
the reboot notifier in the panic path may not be a good idea either, it
might lead to false expectations. :-/
> I think it would be important not to stop the watchdog if possible,
> given the large amount of logic that's going to be executed
> after the jump.
I agree. Especially since kdump is still not 100% reliable.
Thanks for the feedback!
More information about the kexec