Intermittent problem launching the backup kernel

Patrick Lengel plengel at sourcefire.com
Tue Oct 8 13:29:15 EDT 2013


Hello Kexec community,

I am writing to see if it is possible to receive some
assistance to figure out why the kexec program is not
working properly on one of my system types.  While I have
successfully added the kexec technology to many of our
systems running various processors and kernel versions
there is one system in particular that has intermittent
problems.  The errant system type fails less than 5% of the
time because it freezes when attempting to begin running
the backup crash kernel stored in memory.  I believe that the
basic framework is correct but something is going amiss
when trying to start the backup crash kernel.

My system is running:

kexec-tools version 2.0.4
Linux kernel 2.6.35.14
Xeon X3450 processor
16G of memory

To gather more information I placed some debug printk
statements into the kexec kernel code to determine the
code path following a memory exception OOPS event.
I verified following code path:

no_context()        - arch/x86/mm/fault.c
oops_end()          - arch/x86/kernel/dumpstack.c
crash_kexec()       - kernel/kexec.c
machine_kexec()  - arch/x86/kernel/machine_kexec_64.c
load_segments()   - Assembly module

I have been able to get the kexec and kdump functionality
to work properly on a nearly identical platform using the
exact same 2.6.35.14 kernel file but running upon Intel
X3430 processor.

My Linux startup command line is:
auto BOOT_IMAGE=3D-5.3.0 ro root=805 console=tty0 \
console=ttyS0,9600 memmap=1G$4G \
crashkernel=128M at 32M oops=panic

And my kexec command is:
/usr/sbin/kexec -p /boot/bzImage-2.6.35.14-sf.westmere-37 \
--append=" root=/dev/sda5 1 irqpoll maxcpus=1 \
reset_devices console=tty0 console=ttyS0,9600 \
memmap=1G$4G  oops=panic"

We have a customer network interface card that uses the
memory segment “memmap=1G$4G” and can be
performing a DMA into main memory over the PCI bus
when the intentional memory exception event occurs.
I tried to place the backup kernel far away from this at an
offset of just 32M in the kernel OS space.  The other
systems that do not fail have these cards as well so this
card would not be the sole reason for the failure.

If there is anyone that can provide any suggestions or
advice on how to proceed that would be very much
appreciated.

Kind regards,
Patrick Lengel



More information about the kexec mailing list