[RFC] Kdump and memory error handling

Mon May 9 13:40:13 EDT 2011

On Mon, May 09, 2011 at 10:59:35PM +0530, K.Prasad wrote:
> On Wed, May 04, 2011 at 10:39:14PM +0200, Andi Kleen wrote:
> > > Any thoughts/suggestions?
> > 
> > My old attempts to solve this are
> > 
> > Don't dump on MCE:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/xpanic
> > 
> 
> The problem we seen in avoiding a panic->crash_kexec->[coredump capture] is
> that the user may not have a means to know the reason for crash, unless
> the serial console is connected to capture and store the panic string.
> 
> Alternatively a 'slim' kdump (as described here:
> https://lkml.org/lkml/2011/5/4/396) would not contain meaningless data from
> the old memory, but inform the user about the cause of the crash. I'm
> intending to post some patches with a quick implementation of it soon.
> 
> > Handle dumps of corrupted memory regresions:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/crashdump
> > 
> 
> > IMHO these patches are still the right solutions for this.
> > 
> 
> Like Vatsa had raised, the processor's behaviour upon reading (or any I/O
> operation) the faulty memory location isn't clearly defined (to the
> extent I read through System Programming Guide Part 1, Volume 3A,
> Chapter 15). In such a scenario, disabling MCE for the kdump kernel (which can
> potentially read the faulty memory) is making things hazy.

How would a slim dump make that any better? And why leaving it to user
space to filter out the relevant pieces is not a good idea? 

I agree that it can lead to failure in case the memory we are dependent
on extracting the right information is corrupted but then slim dump
should have similar issues too (until and unless we do something smart
of determining the safe reason and putting all the inforamtion regarding
dump there from inside the kernel after the fault).

Thanks
Vivek