[Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

Wed Oct 5 03:31:11 EDT 2011

On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote:
> > Well, there are MCE types for which we need to panic but we don't
> > necessarily corrupt memory. Your approach is to unconditionally avoid
> > dumping core whenever we panic while you should look at the MCE
> > signature and decide then whether to capture crashed kernel memory or
> > not.
> > 
> > For example, if the MCE signature says UC DRAM error, then you can
> > be pretty sure that there is a landmine somewhere in the DRAM region
> > mapping the crashed kernel. If it is, say, a UC when doing data fills
> > from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
> > even in the first case, you can evaluate the MCi_ADDR reported with the
> > UC DRAM error and simply skip that particular cacheline when dumping the
> > core instead of not capturing anything at all.
> > 
> 
> True. Like stated by me earlier, there could be two possible outcomes
> from capturing memory dump in such cases - they're either dangerous or
> doesn't make sense.

Why, in the second example the only corruption is to the L2 cache so
your memory image is intact. Why wouldn't you want to capture a memory
dump then? It is business as usual in that case.

> It is best to avoid a normal kdump in both cases,
> although the elf-note doesn't distinguish between the two.
> 
> NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
> a framework where different code paths that lead to panic() can
> 'opt-out' from kdump by adding an elf-note.
> 
> We can modify this to add more fine-grained messages using different elf-note
> types (or use the elf-note name under the NT_NOCOREDUMP type) to
> indicate the cause/type of crash.
> 
> I'd like to hear further from you and the rest of the community to see if
> there's a need felt for such a change.

I'd make this conditional on whether you have had memory corruption or
not by evaluating MCE signatures and acting accordingly.

> > Btw, the doublefault example you give above - is this something you
> > experience on real hardware or just a theoretical thing?
> >
> 
> Unfortunately, I still haven't been able to try injecting memory errors
> and study the behaviour (trying to get access to machine with
> appropriate firmware). I'll have a reply to this after some experiments
> with memory error injection.

Right, this might be much more helpful than theoretical discussions on
what to do. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.