[PATCH net-next v2 0/2] kernel: add support to collect hardware logs in crash recovery kernel

Rahul Lakkireddy rahul.lakkireddy at chelsio.com
Mon Mar 26 06:45:40 PDT 2018


On Saturday, March 03/24/18, 2018 at 20:50:52 +0530, Eric W. Biederman wrote:
> 
> Rahul Lakkireddy <rahul.lakkireddy at chelsio.com> writes:
> 
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> >
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device in the crash recovery kernel. In crash
> > recovery kernel, the collected logs are exposed via /sys/kernel/crashdd/
> > directory, which is copied by user space scripts for post-analysis.
> >
> > A kernel module crashdd is newly added. In crash recovery kernel,
> > crashdd exposes /sys/kernel/crashdd/ directory containing device
> > specific hardware/firmware logs.
> 
> Have you looked at instead of adding a sysfs file adding the dumps
> as additional elf notes in /proc/vmcore?
> 

I see the crash recovery kernel's memory is not present in any of the
the PT_LOAD headers.  So, makedumpfile is not collecting the dumps
that are in crash recovery kernel's memory.

Also, are you suggesting exporting the dumps themselves as PT_NOTE
instead?  I'll look into doing it this way.

> That should allow existing tools to capture your extended dump
> information with no code changes, and it will allow having a single file
> core dump for storing the information.
> 
> Both of which should mean something that will integrate better into
> existing flows.
> 
> The interface logic of the driver should be essentially the same.
> 
> 
> Also have you tested this and seen how well your current logic captures
> the device information?
> 

Yes, the hardware snapshot is pretty close to the state during kernel
panic.  It is better than risking not being able to collect anything
at all during kernel panic.

Thanks,
Rahul



More information about the kexec mailing list