[Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump
tony.luck at intel.com
Wed Oct 5 11:58:53 EDT 2011
> > The plan is to pass-down the list of poisoned memory pages to the second
> > kernel using an elf-note so that these pages are left untouched during
> > dump capture. I'm working on an implementation of the same and should
> > have patches soon.
> I would say let us first figure out what happens while reading a poisoned
> page and is this a problem before working on a solution.
If the page is poisoned because of a real uncorrectable error in memory
(reported as SRAO machine check today, or by SRAR real-soon-now). Then
accessing the page from the processor while taking a memory dump will
result in a machine check.
Note that a large memory system that had been running for a long time
may have built up a small stash of these land-mine pages - and we need
to worry about them even in the case where the panic is not machine
check related (in fact especially in this case ... we are in a case
where we actually do want the dump to diagnose the cause of the panic,
and we don't want to risk losing the crash dump because we aborted when
touching a page that the OS had safely avoided for days/weeks/months).
So passing a list of poisoned pages from the old kernel to the new kernel
is a good idea - and is independent of the cause of the crash (except that
in the fatal machine check case due to memory error the list is guaranteed
to be non-empty).
Passing some crash signature data - so the new kernel/dump-tools can make
a choice whether to even try to take a full dump is also interesting (but
independent from the bad page list).
More information about the kexec