/proc/vmcore mmap() failure issue
HATAYAMA Daisuke
d.hatayama at jp.fujitsu.com
Thu Nov 14 05:31:37 EST 2013
(2013/11/14 5:41), Vivek Goyal wrote:
> Hi Hatayama,
>
> We are facing some /proc/vmcore mmap() failure issues and then makdumpfile
> exits without saving dump and system reboots.
>
> I tried latest makedumpfile (devel branch) with 3.12 kernel.
>
> I think this issue happens only on some machines. And it looks like it
> happens when end of system RAM chunk in first kernel is not page aligned. For
> example, I have one machine where I noticed it and this is how system
> RAM looks like.
>
> 00100000-dafa57ff : System RAM
> 01000000-015892fa : Kernel code
> 015892fb-0195c9ff : Kernel data
> 01ae6000-01d31fff : Kernel bss
> 24000000-33ffffff : Crash kernel
> dafa5800-dbffffff : reserved
>
> Notice that dafa57ff does not end at page boundary and next reserved
> range does not start at page boundary. I think that next reserved
> range is referenced through some ACPI data. More on this later.
>
> So we put some printk() messages to get more info. In a nut shell,
> remap_pfn_range() fails when we try to map the last section of system
> RAM not ending on page boundary.
>
> remap_pfn_range()
> track_pfn_remap() {
> /*
> * For anything smaller than the vma size we set prot based on the
> * lookup.
> */
> flags = lookup_memtype(paddr);
>
> /* Check memtype for the remaining pages */
> while (size > PAGE_SIZE) {
> size -= PAGE_SIZE;
> paddr += PAGE_SIZE;
> if (flags != lookup_memtype(paddr))
> return -EINVAL; <---------------- Failure.
> }
>
> }
>
>
> So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000.
> Now we call lookup_memtype() on every page in the range and make sure
> they all are same, otherwise we fail. Guess what, all all same except
> last page (which does not end at page boundary).
>
> I dived deeper in to lookup_memtype() and noticed that all regular
> ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS.
> But last unaligned page/range, is registered in memtype rb tree and
> has attribute, _PAGE_CACHE_WB.
>
> Then I hooked into reserve_memtype() to figure out who is registering
> page 0xdafa5000 and it is acpi_init() which does it.
>
> [ 0.721655] Hardware name: <edited>
> [ 0.730590] ffff8800340f3830 ffff8800340f37c0 ffffffff81575509
> 00000000dafa5000
> [ 0.738010] ffff8800340f3800 ffffffff810566cc 00000000000dafa5
> 00000000dafa5000
> [ 0.745428] 00000000dafa6000 00000000dafa5000 0000000000000000
> 0000000000001000
> [ 0.752845] Call Trace:
> [ 0.755288] [<ffffffff81575509>] dump_stack+0x45/0x56
> [ 0.760414] [<ffffffff810566cc>] reserve_memtype+0x31c/0x3f0
> [ 0.766144] [<ffffffff810537ef>] __ioremap_caller+0x12f/0x360
> [ 0.771963] [<ffffffff8130ad56>] ? acpi_os_release_object+0xe/0x12
> [ 0.778217] [<ffffffff815686ba>] ? acpi_os_map_memory+0xf6/0x14e
> [ 0.784295] [<ffffffff81053a54>] ioremap_cache+0x14/0x20
> [ 0.789679] [<ffffffff815686ba>] acpi_os_map_memory+0xf6/0x14e
> [ 0.795582] [<ffffffff81322ac9>]
> acpi_ex_system_memory_space_handler+0xdd/0x1ca
> [ 0.802961] [<ffffffff8131ca48>]
> acpi_ev_address_space_dispatch+0x1b0/0x208
> [ 0.809993] [<ffffffff8131fd49>] acpi_ex_access_region+0x20e/0x2a2
> [ 0.816244] [<ffffffff81149464>] ? __alloc_pages_nodemask+0x134/0x300
> [ 0.822754] [<ffffffff813200e4>] acpi_ex_field_datum_io+0xf6/0x171
> [ 0.829004] [<ffffffff81320301>] acpi_ex_extract_from_field+0xd7/0x20a
> [ 0.835602] [<ffffffff81331d80>] ?
> acpi_ut_create_internal_object_dbg+0x23/0x8a
> [ 0.842981] [<ffffffff8131f8e7>]
> acpi_ex_read_data_from_field+0x10f/0x14b
> [ 0.849838] [<ffffffff81322e16>]
> acpi_ex_resolve_node_to_value+0x18e/0x21c
> [ 0.856780] [<ffffffff813230a6>] acpi_ex_resolve_to_value+0x202/0x209
> [ 0.863291] [<ffffffff81319486>] acpi_ds_evaluate_name_path+0x7b/0xf5
> [ 0.869803] [<ffffffff81319834>] acpi_ds_exec_end_op+0x98/0x3e8
> [ 0.875793] [<ffffffff8132aca4>] acpi_ps_parse_loop+0x514/0x560
> [ 0.881784] [<ffffffff8132b738>] acpi_ps_parse_aml+0x98/0x28c
> [ 0.887601] [<ffffffff8132bf8d>] acpi_ps_execute_method+0x1c1/0x26c
> [ 0.893939] [<ffffffff813269c5>] acpi_ns_evaluate+0x1c1/0x258
> [ 0.899755] [<ffffffff8131cb98>] acpi_ev_execute_reg_method+0xca/0x112
> [ 0.906353] [<ffffffff8131cd6e>] acpi_ev_reg_run+0x48/0x52
> [ 0.911910] [<ffffffff81328fad>] acpi_ns_walk_namespace+0xc8/0x17f
> [ 0.918160] [<ffffffff8131cd26>] ? acpi_ev_detach_region+0x146/0x146
> [ 0.924585] [<ffffffff8131cdbc>] acpi_ev_execute_reg_methods+0x44/0xf7
> [ 0.931184] [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [ 0.937349] [<ffffffff8130ac66>] ? acpi_os_wait_semaphore+0x43/0x57
> [ 0.943686] [<ffffffff81331a3f>] ? acpi_ut_acquire_mutex+0x48/0x88
> [ 0.949938] [<ffffffff8131ceb8>]
> acpi_ev_initialize_op_regions+0x49/0x71
> [ 0.956709] [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [ 0.962873] [<ffffffff81333310>] acpi_initialize_objects+0x23/0x4f
> [ 0.969125] [<ffffffff819b23b4>] acpi_init+0x90/0x268
>
> So basically, this split page seems to be a problem. Some other code
> thinks that it has access to full page and goes ahead and registers
> that with PAT rb tree and this causes problems in mmap() code.
>
> I suspect we might have to go back to idea of copying first and last
> non page aligned ranges in new kernel's memory and read it from there
> to solve this issue. Do you have other ideas?
>
Sorry for delayed response, although it looks like you have already found
a way to fix this issue.
BTW, I previously found a part of makedumpfile that truncates the first and
last pages if they are not aligned in page size. Discussing with Kumagai-san,
the truncation is performed on some ia64 system and he found a valid data in
the truncated area, and the latest makedumpfile no longer does such
truncation.
The commit is:
commit f854b37adba223d5b4801accbedd17b447266d51
Author: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>
Date: Fri Jun 21 15:25:31 2013 +0900
[PATCH 2/2] Fix the handling of the pages correspond to border of PT_LOAD.
The pages correspond to border of PT_LOAD were removed as holes.
For example, pfn:N showed below was removed but we know even
odd region like [0x40ffda7000 - 0x40ffda8000] can include valid
dates, so we shouldn't remove it as holes.
phys_start
= 0x40ffda7000
|<-- frac_head -->|------------- PT_LOAD -------------
----+-----------------------+---------------------+----
| pfn:N | pfn:N+1 | ...
----+-----------------------+---------------------+----
|
pfn_to_paddr(pfn:N) # page size = 16k
= 0x40ffda4000
This patch handles such odd regions correctly. Then read pfn:N
and write it to disk, the ranges not covered by any PT_LOAD
entries will be filled with 0.
Signed-off-by: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>
The log on the web is:
http://lists.infradead.org/pipermail/kexec/2013-May/008875.html
So, without this change, you would not have seen this issue. The original
reason why the code was implemented so might be the issues similar to here.
Next, I think it necessary to consider whether or not to revert the above
commit or not since makedumpfile fails on some kind of system as you reported.
--
Thanks.
HATAYAMA, Daisuke
More information about the kexec
mailing list