[BUG REPORT] kexec and makedumpfile can't detect PAGE_OFFSET on arm (Wang Nan)

Mon May 19 12:41:58 PDT 2014

----- Original Message -----
> 
> Hi Atsushi and Simon,
> 
> I find a problem about VMSPLIT on arm plarform, related to kexec and
> makedumpfile.
> 
> When CONFIG_VMSPLIT_1G/2G is selected by kernel, PAGE_OFFSET is actually
> 0x40000000 or 0x80000000. However, kexec hard codes PAGE_OFFSET to
> 0xc0000000 (in kexec/arch/arm/crashdump-arm.h), which is incorrect in
> these situations. For example, on realview-pbx board with 1G/3G VMSPLIT,
> PHDRs in generated /proc/vmcore is as follow:
> 
>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>   NOTE           0x001000 0x00000000 0x00000000 0x00690 0x00690     0
>   LOAD           0x002000 0xc0000000 0x00000000 0x10000000 0x10000000 RWE 0
>   LOAD           0x10002000 0xe0000000 0x20000000 0x8000000 0x8000000 RWE 0
>   LOAD           0x18002000 0xf0000000 0x30000000 0x10000000 0x10000000 RWE 0
>   LOAD           0x28002000 0x40000000 0x80000000 0x10000000 0x10000000 RWE 0
> 
> Which should be:
> 
>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>   ...
>   LOAD            ...     0x40000000 0x00000000 0x10000000 0x10000000 RWE 0
>   LOAD            ...     0x60000000 0x20000000 0x8000000 0x8000000 RWE 0
>   LOAD            ...     0x70000000 0x30000000 0x10000000 0x10000000 RWE 0
>   LOAD            ...     0xc0000000 0x80000000 0x10000000 0x10000000 RWE 0
> 
> I don't know why crash utility can deal with it without problem,

For ARM the crash utility masks the symbol value of "_stext" with 0x1fffffff
to determine the PAGE_OFFSET value, which was basically copied from the way
it was done for i386. 

> but in makedumpfile such VMSPLIT setting causes segfault:
> 
>  $ ./makedumpfile -c -d 31 /proc/vmcore ./out -f
>  The kernel version is not supported.
>  The created dumpfile may be incomplete.
>  Excluding unnecessary pages        : [  0.0 %] /Segmentation fault
> 
> There are many ways to deal with it, I want discuss them in the maillist and
> make a decision:
> 
>  1. Kexec changes, detect PAGE_OFFSET dynamically. However, I don't know
>     whether there is a reliably way for this purpose, here I suggest
>     kernel to export PAGE_OFFSET through sysfs, such as
>     /sys/kernel/page_offset.
> 
>  2. Or, kexec accepts PAGE_OFFSET as a command line arguments, let user
>     provide correct information.
> 
>  3. Or, makedumpfile changes, don't trust EHDR anymore. Kernel should
>     export PAGE_OFFSET through VMCOREINFO.
> 
> How do you feel?
> 
> Thank you!
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 19 May 2014 11:11:40 -0400
> From: Vivek Goyal <vgoyal at redhat.com>
> To: "bhe at redhat.com" <bhe at redhat.com>
> Cc: "kexec at lists.infradead.org" <kexec at lists.infradead.org>,
> 	"d.hatayama at jp.fujitsu.com" <d.hatayama at jp.fujitsu.com>, Atsushi
> 	Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>, "zzou at redhat.com"
> 	<zzou at redhat.com>, Larry Woodman <lwoodman at redhat.com>
> Subject: Re: [PATCH] makedumpfile: change the wrong code to calculate
> 	bufsize_cyclic for elf dump
> Message-ID: <20140519151140.GF650 at redhat.com>
> Content-Type: text/plain; charset=us-ascii
> 
> On Mon, May 19, 2014 at 07:15:38PM +0800, bhe at redhat.com wrote:
> 
> [..]
> > -------------------------------------------------
> > bhe# cat /etc/kdump.conf
> > path /var/crash
> > core_collector makedumpfile -E --message-level 1 -d 31
> > 
> > ------------------------------------------
> > kdump: dump target is /dev/sda2
> > kdump: saving [    9.595153] EXT4-fs (sda2): re-mounted. Opts:
> > data=ordered
> > to /sysroot//var/crash/127.0.0.1-2014.05.19-18:50:18/
> > kdump: saving vmcore-dmesg.txt
> > kdump: saving vmcore-dmesg.txt complete
> > kdump: saving vmcore
> > 
> > calculate_cyclic_buffer_size, get_free_memory_size: 68857856
> > 
> >  Buffer size for the cyclic mode: 27543142
> 
> Bao,
> 
> So 68857856 is 65MB. So we have around 65MB free when makedumpfile
> started.
> 
> 27543142  is 26MB. So we reserved 26MB for bitmaps or we reserved
> 52MB for bitmaps?
> 
> Looking at the backtrace, larry pointed out few things.
> 
> - makedumpfile has already allocated around 52MB of anonymous memory. I
>   guess this primarily comes from bitmaps and looks like we are reserving
>   52MB in bitmaps and not 26MB. I think this could be consistent with
>   current 80% logic as 80% of 65MB is around 52MB.
> 
> 	[   15.427173] Killed process 286 (makedumpfile) total-vm:79940kB,
> 			anon-rss:54132kB, file-rss:892kB
> 
> - So we are left with 65-52 = 13MB of total memory for kernel as well
>   as makedumpfile.
>  
> - We have around 1500 pages in page cache which are in writeback stage.
>   That means around 6MB of pages are dirty and being written back to
>   disk. That means makedumpfile might not require lot of memory but
>   kernel does require free memory in dirty/writeback pages when dump
>   file is being written.
> 
> 	[   15.167732]  unevictable:7137 dirty:2 writeback:1511 unstable:0
> 
> - Larry mentioend that there are around 5000 pages (20MB of memory)
>   sitting in file pages in page cache which ideally should be reclaimable.
>   It is not clear why that memory is not being reclaimed fast enough.
> 
> 	[   15.167732]  active_file:2406 inactive_file:2533 isolated_file:0
> 
> So to me bottom line is that once the write out starts, kernel needs
> memory for holding dirty and writeback pages in cache too. So we probably
> are being too aggresive in allocating 80% of free memory for bitmaps. May
> be we should drop it down to 50-60%  of free memory for bitmaps.
> 
> Thanks
> Vivek
> 
> 
> 
> 
> > Copying data                       : [ 15.9 %] -[   14.955468]
> > makedumpfile invoked oom-killer: gfp_mask=0x10200da, order=0,
> > oom_score_adj=0
> > [   14.963876] makedumpfile cpuset=/ mems_allowed=0
> > [   14.968723] CPU: 0 PID: 286 Comm: makedumpfile Not tainted
> > 3.10.0-123.el7.x86_64 #1
> > [   14.976606] Hardware name: Hewlett-Packard HP Z420 Workstation/1589,
> > BIOS J61 v01.02 03/09/2012
> > [   14.985567]  ffff88002fedc440 00000000f650c592 ffff88002fcb57d0
> > ffffffff815e19ba
> > [   14.993291]  ffff88002fcb5860 ffffffff815dd02d ffffffff810b68f8
> > ffff8800359dc0c0
> > [   15.001013]  ffffffff00000206 ffffffff00000000 0000000000000000
> > ffffffff81102e03
> > [   15.008733] Call Trace:
> > [   15.011413]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
> > [   15.016778]  [<ffffffff815dd02d>] dump_header+0x8e/0x214
> > [   15.022321]  [<ffffffff810b68f8>] ? ktime_get_ts+0x48/0xe0
> > [   15.028036]  [<ffffffff81102e03>] ? proc_do_uts_string+0xe3/0x130
> > [   15.034383]  [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
> > [   15.040446]  [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
> > [   15.047068]  [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
> > [   15.052864]  [<ffffffff8114b579>] __alloc_pages_nodemask+0xa09/0xb10
> > [   15.059482]  [<ffffffff81188779>] alloc_pages_current+0xa9/0x170
> > [   15.065711]  [<ffffffff811419f7>] __page_cache_alloc+0x87/0xb0
> > [   15.071804]  [<ffffffff81142606>]
> > grab_cache_page_write_begin+0x76/0xd0
> > [   15.078646]  [<ffffffffa02aa133>] ext4_da_write_begin+0xa3/0x330
> > [ext4]
> > [   15.085495]  [<ffffffff8114162e>]
> > generic_file_buffered_write+0x11e/0x290
> > [   15.092504]  [<ffffffff81143785>]
> > __generic_file_aio_write+0x1d5/0x3e0
> > [   15.099294]  [<ffffffff81050f00>] ?
> > rbt_memtype_copy_nth_element+0xa0/0xa0
> > [   15.106385]  [<ffffffff811439ed>] generic_file_aio_write+0x5d/0xc0
> > [   15.112841]  [<ffffffffa02a0189>] ext4_file_write+0xa9/0x450 [ext4]
> > [   15.119321]  [<ffffffff8117997c>] ? free_vmap_area_noflush+0x7c/0x90
> > [   15.125884]  [<ffffffff811af36d>] do_sync_write+0x8d/0xd0
> > [   15.131492]  [<ffffffff811afb0d>] vfs_write+0xbd/0x1e0
> > [   15.136839]  [<ffffffff811b0558>] SyS_write+0x58/0xb0
> > [   15.142091]  [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
> > [   15.148293] Mem-Info:
> > [   15.150770] Node 0 DMA per-cpu:
> > [   15.154138] CPU    0: hi:    0, btch:   1 usd:   0
> > [   15.159133] Node 0 DMA32 per-cpu:
> > [   15.162741] CPU    0: hi:   42, btch:   7 usd:  12
> > [   15.167732] active_anon:14395 inactive_anon:1034 isolated_anon:0
> > [   15.167732]  active_file:2406 inactive_file:2533 isolated_file:0
> > [   15.167732]  unevictable:7137 dirty:2 writeback:1511 unstable:0
> > [   15.167732]  free:488 slab_reclaimable:2371 slab_unreclaimable:3533
> > [   15.167732]  mapped:1110 shmem:1065 pagetables:166 bounce:0
> > [   15.167732]  free_cma:0
> > [   15.203076] Node 0 DMA free:508kB min:4kB low:4kB high:4kB
> > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> > unevictabs
> > [   15.242882] lowmem_reserve[]: 0 128 128 128
> > [   15.247447] Node 0 DMA32 free:1444kB min:1444kB low:1804kB
> > high:2164kB active_anon:57580kB inactive_anon:4136kB active_file:9624kB
> > inacts
> > [   15.292683] lowmem_reserve[]: 0 0 0 0
> > [   15.296761] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U)
> > 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB B
> > [   15.310372] Node 0 DMA32: 78*4kB (UEM) 52*8kB (UEM) 17*16kB (UM)
> > 12*32kB (UM) 2*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*40B
> > [   15.324412] Node 0 hugepages_total=0 hugepages_free=0
> > hugepages_surp=0 hugepages_size=2048kB
> > [   15.333088] 13144 total pagecache pages
> > [   15.337161] 0 pages in swap cache
> > [   15.340708] Swap cache stats: add 0, delete 0, find 0/0
> > [   15.346165] Free swap  = 0kB
> > [   15.349280] Total swap = 0kB
> > [   15.353385] 90211 pages RAM
> > [   15.356420] 53902 pages reserved
> > [   15.359880] 6980 pages shared
> > [   15.363088] 29182 pages non-shared
> > [   15.366719] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents
> > oom_score_adj name
> > [   15.374788] [   85]     0    85    13020      553      24        0
> > 0 systemd-journal
> > [   15.383818] [  134]     0   134     8860      547      22        0
> > -1000 systemd-udevd
> > [   15.392664] [  146]     0   146     5551      245      23        0
> > 0 plymouthd
> > [   15.401167] [  230]     0   230     3106      537      16        0
> > 0 dracut-pre-pivo
> > [   15.410181] [  286]     0   286    19985    13756      55        0
> > 0 makedumpfile
> > [   15.418942] Out of memory: Kill process 286 (makedumpfile) score 368
> > or sacrifice child
> > [   15.427173] Killed process 286 (makedumpfile) total-vm:79940kB,
> > anon-rss:54132kB, file-rss:892kB
> > //lib/dracut/hooks/pre-pivot/9999-kdump.sh: line
> > Generating "/run/initramfs/rdsosreport.txt"
> > 
> > > 
> > > 
> > > Thanks
> > > Atsushi Kumagai
> > > 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 19 May 2014 17:09:48 +0100
> From: Will Deacon <will.deacon at arm.com>
> To: Wang Nan <wangnan0 at huawei.com>
> Cc: "linux at arm.linux.org.uk" <linux at arm.linux.org.uk>,
> 	"kexec at lists.infradead.org" <kexec at lists.infradead.org>, Geng Hui
> 	<hui.geng at huawei.com>, Simon Horman <horms at verge.net.au>, Andrew
> 	Morton <akpm at linux-foundation.org>,
> 	"linux-arm-kernel at lists.infradead.org"
> 	<linux-arm-kernel at lists.infradead.org>
> Subject: Re: [PATCH Resend] ARM: kdump: makes second kernel use strict
> 	pfn_valid
> Message-ID: <20140519160947.GM15130 at arm.com>
> Content-Type: text/plain; charset=us-ascii
> 
> On Mon, May 19, 2014 at 02:54:03AM +0100, Wang Nan wrote:
> > When SPARSEMEM and CRASH_DUMP both selected, simple pfn_valid prevents
> > the second kernel ioremap first kernel's memory if the address falls
> > into second kernel section. This limitation requires the second kernel
> > occupies a full section, and elfcorehdr must resides in another section.
> > 
> > This patch makes crash dump kernel use strict pfn_valid, removes such
> > limitation.
> > 
> > For example:
> > 
> >   For a platform with SECTION_SIZE_BITS == 28 (256MiB) and
> >   crashkernel=128M at 0x28000000 in kernel cmdline, the second
> >   kernel is loaded at 0x28000000. Kexec puts elfcorehdr at
> >   0x2ff00000, and passes 'elfcorehdr=0x2ff00000 mem=130048K' to
> >   second kernel. When second kernel start, it tries to use
> >   ioremap to retrive its elfcorehrd. In this case, elfcodehdr is at the
> >   same section of the second kernel, pfn_valid will recongnize
> >   the page as valid, so ioremap will refuse to map it.
> 
> So isn't the issue here that you're passing an incorrect mem= parameter
> to the crash kernel?
> 
> Will
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> kexec mailing list
> kexec at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 
> ------------------------------
> 
> End of kexec Digest, Vol 86, Issue 28
> *************************************
>