[PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

Sun Nov 16 21:22:01 PST 2014

From: Petr Tesarik <ptesarik at suse.cz>
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Fri, 14 Nov 2014 13:36:10 +0100

> On Fri, 14 Nov 2014 18:54:23 +0900 (JST)
> HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com> wrote:
> 
>> From: Petr Tesarik <ptesarik at suse.cz>
>> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
>> Date: Fri, 14 Nov 2014 09:31:45 +0100
>> 
>> > On Fri, 14 Nov 2014 10:42:35 +0900 (JST)
>> > HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com> wrote:
>> > 
>> >> From: Petr Tesarik <ptesarik at suse.cz>
>> >> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
>> >> Date: Thu, 13 Nov 2014 15:48:10 +0100
>> >> 
>> >> > On Thu, 13 Nov 2014 09:25:48 -0500
>> >> > Vivek Goyal <vgoyal at redhat.com> wrote:
>> >> > 
>> >> >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
>> >> >> > 
>> >> >> > (2014/11/13 17:06), Petr Tesarik wrote:
>> >> >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
>> >> >> > >HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com> wrote:
>> >> >> > >
>> >> >> > >>From: Vivek Goyal <vgoyal at redhat.com>
>> >> >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
>> >> >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500
>> >> >> > >>
>> >> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> >> >> > >>>>Currently, VMCOREINFO note information reports the virtual address of
>> >> >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't make
>> >> >> > >>>>sense because to refer to value of the phys_base, it's necessary to
>> >> >> > >>>>get the value of phys_base itself we are now about to refer to.
>> >> >> > >>>>
>> >> >> > >>>
>> >> >> > >>>Hi Hatayama,
>> >> >> > >>>
>> >> >> > >>>/proc/vmcore ELF headers have virtual address information and using
>> >> >> > >>>that you should be able to read actual value of phys_base. gdb deals
>> >> >> > >>>with virtual addresses all the time and can read value of any symbol
>> >> >> > >>>using those headers.
>> >> >> > >>>
>> >> >> > >>>So I am not sure what's the need for exporting actual value of
>> >> >> > >>>phys_base.
>> >> >> > >>>
>> >> >> > >>
>> >> >> > >>Sorry, my logic in the patch description was wrong. For /proc/vmcore,
>> >> >> > >>there's enough information for makedumpdile to get phys_base. It's
>> >> >> > >>correct. The problem here is that other crash dump mechanisms that run
>> >> >> > >>outside Linux kernel independently don't have information to get
>> >> >> > >>phys_base.
>> >> >> > >
>> >> >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, will
>> >> >> > >they?
>> >> >> > >
>> >> >> > 
>> >> >> > I don't intend such sophisticated function only by VMCOREINFO.
>> >> >> > Search vmcore for VMCOREINFO using strings + grep before opening it by crash.
>> >> >> > I intend that only here.
>> >> >> 
>> >> >> I think this is very crude and not proper way to get to vmcoreinfo.
>> >> > 
>> >> > Same here. If VMCOREINFO must be locatable without communicating any
>> >> > information to the hypervisor, then I would rather go for something
>> >> > similar to what s390(x) folks do - a well-known location in physical
>> >> > memory that contains a pointer to a checksummed OS info structure,
>> >> > which in turn contains the VMCOREINFO pointers.
>> >> > 
>> >> > I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
>> >> > Or is that part of the current plan, Daisuke?
>> >> > 
>> >> 
>> >> It's useful if there is. I don't plan now. For now, the idea of this
>> >> patch is enough for me.
>> >> 
>> >> BTW, for the above idea, I suspect that if the location in the
>> >> physical memory is unique, it cannot deal with the kdump 2nd kernel
>> >> case.
>> > 
>> > No, not at all. The low 640K are copied away to a pre-allocated area by
>> > kexec purgatory code on x86_64, so it's safe to overwrite any location
>> > in there. The copy is needed, because BIOS already uses some hardcoded
>> > addresses in that range. I think the Linux kernel may safely use part of
>> > PFN 0 starting at physical address 0x0500. This area was originally
>> > used by MS-DOS, so chances are high that no broken BIOS out there
>> > corrupts this part of RAM...
>> > 
>> 
>> In fact, I didn't consider in such deep way... I had forgot back up
>> region at all. But it's hard to use the low 640K area. Then, it's hard
>> to get phys_base of the kdump 1st kernel that is assumed to be saved
>> in thw low 640K now. Because externally running mechanism can run
>> after kdump 2nd kernel has booted up, crash utility needs to convert a
>> read request to the low 640K area into the corresponding part of the
>> pre-allocated area. See kdump_backup_region_init() in crash utility,
>> which tries to find the pre-allocated area via ELF header, where
>> symbol kexec_crash_image is read to find ELF header. This means we
>> need phys_base to find the pre-allocated area.
> 
> Wrong again, I'm afraid.
> 
> So, first of all, an admin should make up your mind if you want to use
> kexec-based dumping, or stand-alone dumping. OK, you seem to address
> a corner case when s/he configures both. But in that case, the

It's a never corner case. We usually use both. There's difference in
data reliability between kdump and others in that kdump can do cleanup
in kernel logic level at the end of the kdump 1st kernel prior to
kdump 2nd kernel, and difference in dumping feature that there's
makedumpfile that can filter memory to size of crash dump. OTOH,
external dump can still possibly work well even if kdump doesn't but
could generate less reliable data and has less features. After all,
it's best to use both.

> stand-alone dump can be used to look at _BOTH_ kernels, and the default
> should indeed be the one that was currently running. After all, I have
> already debugged the _SECONDARY_ kernel environment several times...
> 
> However, it even works. If somebody wants to see the crashed kernel
> from the same dump, they can use the second kernel's internal
> structures to locate the corresponding phys_base and pass that as an
> option to crash.
> 
> Let me illustrate the situation:
> 
>   +-------------------+
>   | secondary kernel  | <--- low 640K
>   | private pointers -+--\
>   |                   |  |  (1)
>   |                   |  |
>   +-------------------+<-+-----\
>   |                   |  |     |
>   | primary kernel    |  |     |
>   Z                   Z  |     |
>   |                   |  |     |
>   +-------------------+<-/     |  (3)
>   | secondary kernel  |        |
>   | (contains pointer |        |
>   |  to backup area) -+--\     |
>   +-------------------+  | (2) |
>   | backup area       |<-/     |
>   |                  -+--------/
>   +-------------------+
>   |                   |
>   | 1st kernel again  |
>   Z                   Z
>   +-------------------+
> 
> The information is nicely chained in this diagram:
> 
>   (1)  Low 640K allows you to find the currently running kernel
>        (here it is the kdump kernel).
>   (2)  This kernel knows where to find the backup area (otherwise it
>        couldn't correctly map them in /proc/vmcore).
>   (3)  The backup area allows yoou to find the previously runnning
>        kernel (the 1st kernel).
> 
> I really don't see any issues with the concept, although I haven't
> tried it in practice (yet).
> 
> Petr T

I'm not assuming that you intend to implement this logic in external
crash dump mechanisms such as qemu; this is too specific to Linux
kernel.

I still think the idea of my patch is simple and practical enough.

--
Thanks.
HATAYAMA, Daisuke