[PATCH 0/2] add new notifier function ,take3

Wed Apr 23 08:31:59 EDT 2008

Takenori Nagano <t-nagano at ah.jp.nec.com> writes:

> Hi,
>
> The one of the reason why I want this functionality is managing RAS
> tool behavior for postmotem actions, initially from kdb invocation.
> (I used kdb for debugging and crash analysis very useful in lkcd days,
> but it is "want" and it is not "must" today ;-))

Ok.  I have not heard any reason here why a break point at panic
or inside of panic is not useful.

> The other postmotem action is disabling hardware watchdog.
> Watch dog handler would stop keepalive heartbeat when system panics
> and we must disable hardware watchdog as soon as possible, since 2nd
> kernel startup takes some time (10 or 100? secs) and there may be
> miss-firing window. But currently we have no chance to do anything
> before crash_exec().

The transition time from one kernel to the next should be under 1 sec.
After that you are talking time for the drivers to initialize.
Although sha256 over the kdump kernel and it's ramdisk may slow things
down a little more for lots of data.

If the concern is of petting a watchdog to keep the system from
rebooting getting the kernel to initialize the watchdogs quickly
appears to be the correct answer.

> And thinking about a clustering software. If the system encounter
> the panic, system must notify standby node. But... :-(

If the concern is to notify another system of the crash quickly.
I see no reason why very early in the second kernel or perhaps
even in the purgatory code in kexec we can not do this.  If the
code is to hairy to do there then the code is likely to be
too hairy to do reliably when the system panics.

> I am interested in pre-dump scripts Neil mentioned. I think it can
> resolve some of our requirements. I will try it.

> For quick invocation of kdump, I partially agree with the idea of
> "kdump should be invoked as soon as system panic, since we can not
> trust broken kernels", but we would like to have some choise what
> to do on panic (and if notifier is controllable by my patch,
> you can still call kdump first)
>
> Anyway, completely broken kernel can not call kdump or any other
> mechanism  ;-P  and I feel it is somewhat matter of degree.

Yes.  Completely broken kernels may not recognize they have a problem.

It is the design goal of the kexec on panic path to work with as much
of kernel broken as possible.

Additionally reviewing and testing that code is extremely difficult,
because it is the one piece of code in the kernel where debugging
tools are not available.  Putting random tunable code on that path
hugely reduces it's maintainability.

Further for all of the cases I have seen there is only one correct
action to take, things do not need to be tunable.  So a generally
tunable interface appears to be a design mistake.

Eric