[Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump
bp at alien8.de
Wed Oct 12 06:44:29 EDT 2011
On Wed, Oct 12, 2011 at 12:14:34AM +0530, K.Prasad wrote:
> The MC4_CTL_MASK doesn't appear to be defined in the kernel. Looking at
> http://support.amd.com/us/Processor_TechDocs/26094.PDF, Page 196, it
> states that "This register is typically programmed by BIOS and not by
> the Kernel software".
Oh, this is K8 BKDG, thus pretty old. For AMD docs, you could use
developer.amd.com, and more specifically
So if we look at the F10h manual:
there's this section "126.96.36.199.1 Machine Check Error Logging and
Reporting" on p. 167 which explains all the modalities around switching
And if you clear CR4.MCE, the machine would shutdown on a fatal MCE as
an additional precation when running software which doesn't support
MCE (fully) but you still don't want to corrupt your data: "If error
reporting is enabled but CR4.MCE is disabled, a reportable error will
cause the system to enter shutdown."
Thus clearing the MCi_CTL_MASK bit should help you.
> So, in any case we may not be able to disable machine-check exceptions
> (MCEs) only within the context of kexec'ed kernel. Let me know if I've
> missed something here.
I'm not sure it is advisable to completely disable MCA for the whole
duration of the image dumping, especially on a system which has already
booted into the second kernel due to an MCE.
> > But, regardless, according to Vivek, the "makedumpfile" tool should be
> > able to jump over poisoned pages and you don't need all the hoopla above
> > at all, right?
> In short, the answer is yes. We could add a new string, say
> "CRASH_REASON=PANIC_MCE" to VMCOREINFO elf-note which can be parsed by
> 'makedumpfile' and get away without adding the new NT_NOCOREDUMP
> elf-note. Parsing through the log_buf to lookout for panic string from
> inside 'makedumpfile' appears to be a clumsy solution though.
Why, 'makedumpfile' reportedly supports some dmesg parsing already -
why would you need additional functionality when it can be done with
in-house means already. Maybe Vivek should comment on whether this makes
sense but I'm basically reiterating what he said.
> i) Scenario1: System crashes because of a fatal MCE
> Proposed Solution: Add a new string in the VMCOREINFO elf-note from
> within the MCE panic path to indicate cause of crash. 'makedumpfile'
> recognises this string to collect a slimdump instead of the normal dump.
> ii) Scenario2: System with PG_hwpoison (or landmine!) pages crashes because
> of a software bug. In this case, kexec kernel would normally reboot because
> of reading the PG_poison page. I'll soon get a new version of the patchset
> implementing this.
> Solution: Maintain a linked list of PFNs when the corresponding 'struct page'
> has been marked PG_hwpoison. We could export/put this list to use in
> quite a few ways.
Let me stop you right there: again, according to Vivek:
makedumpfile can iterate over the struct page arrays and skip over
PG_hwpoison pages. I think this should be enough of functionality....
More information about the kexec