[PATCH], issue EOI to APIC prior to calling crash_kexec in die_nmi path

Wed Feb 6 20:30:21 EST 2008

Ingo Molnar <mingo at elte.hu> writes:

> * Eric W. Biederman <ebiederm at xmission.com> wrote:
>
>> Looking at the patch the local_irq_enable() is totally bogus.  As soon 
>> was we hit machine_crash_shutdown the first thing we do is disable 
>> irqs.
>
> yeah.
>
>> I'm wondering if someone was using the switch cpus on crash patch that 
>> was floating around.  That would require the ipis to work.
>> 
>> I don't know if nmi_exit makes sense.  There are enough layers of 
>> abstraction in that piece of code I can't quickly spot the part that 
>> is banging the hardware.
>> 
>> The location of nmi_exit in the patch is clearly wrong.  crash_kexec 
>> is a noop if we don't have a crash kernel loaded (and if we are not 
>> the first cpu into it), so if we don't execute the crash code 
>> something weird may happen.  Further the code is just more 
>> maintainable if that kind of code lives in machine_crash_shutdown.
>
> nmi_exit() has no hw effects - it's just our own bookeeping.

Alright so that should have no hardware effect then.

> the hw knows that we finished the NMI when we do an iret. Perhaps that's 
> the bug or side-effect that made the difference: via enabling irqs we 
> get an irq entry, and that does an iret and clears the NMI nested state 
> - allowing the kexec context to proceed? I suspect kexec() will do an 
> iret eventually (at minimum in the booted up kernel's context) - all 
> NMIs are blocked up to that point and maybe the APIC doesnt really like 
> being frobbed in that state? In any case, the local_irq_enable() is just 
> wrong - it's the worst thing a crashing kernel can do. Perhaps doing an 
> intentional iret with a prepared stack-let that just restores to 
> still-irqs-off state and jumps to the next instruction could 'exit' the 
> NMI context without really having to exit it in the kernel code flow?

Well I think this is slightly on the wrong track.  The original patch description
said  we get as far as purgatory.  Purgatory is the bit of C code form
/sbin/kexec that runs just before our second kernel.  It sets up
arguments and verify the target kernel has a valid sha256sum.  If
purgatory detects data corruption it spins, to prevent a corrupt
recovery kernel from doing something nasty.

It appears that the primary crash_kexec path is working fine.

The original description speculated that we had non-stopped cpus that
were telling the hardware to shut off.

I don't see what the hang is.  However the goal apparently is to make
the kexec on panic path more robust so that we can take crash dumps in
more strange cases.

We can get NMI from the nmi watchdogs so it is possible this happens
on legitimate hardware so there is a chance this is deterministic and
that we can get enough information to debug and fix the original
error.

If part of the problem is getting to crash_kexec my inclination is to
move the call to crash_kexec up as early as possible in die_nmi.  As
we may simply be hanging in printk or something stupid like that.

It is weird that only the 32bit die_nmi path calls bust_spinlocks.

I'm not really happy with the secondary cpus taking whole notify_die
path as that is more general purpose infrastructure that might go
bad.  However it doesn't appear broken, and it should not be critical
to the crash dump process.

Eric