Question about Address Range Validation in Crash Kernel Allocation
Li Huafei
lihuafei1 at huawei.com
Thu Mar 21 05:37:05 PDT 2024
On 2024/3/21 18:06, Dave Young wrote:
> Hi,
>
> On Thu, 21 Mar 2024 at 17:49, Li Huafei <lihuafei1 at huawei.com> wrote:
>>
>> Hi Baoquan,
>>
>> On 2024/3/21 17:17, chenhaixiang (A) wrote:
>>>
>>>>> I'm sorry for the delay. Here are some details from the boot log and
>>>> /proc/iomem:
>>>>> The Boot log:
>>>>> [ 0.000000] Linux version 6.8.0 (root at localhost.localdomain) (gcc (GCC)
>>>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
>>>> 11:46:11 UTC 2024
>>>>> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
>>>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
>>>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers softlockup_panic=1
>>>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
>>>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
>>>> console=ttyS0,115200n8 console=tty0
>>>> ......snip...
>>>>> [ 0.022622] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
>>>> from=0x0000000000000000 max_addr=0x0000000100000000
>>>> reserve_crashkernel_generic+0x7c/0x220
>>>>> [ 0.022628] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
>>>> from=0x0000000100000000 max_addr=0x0000400000000000
>>>> reserve_crashkernel_generic+0x7c/0x220
>>>>> [ 0.022632] memblock_reserve: [0x000000c01f000000-0x000000c03effffff]
>>>> memblock_alloc_range_nid+0xee/0x170
>>>>> [ 0.022634] memblock_phys_alloc_range: 268435456 bytes align=0x1000000
>>>> from=0x0000000000000000 max_addr=0x0000000100000000
>>>> reserve_crashkernel_generic+0x11d/0x220
>>>>> [ 0.022638] memblock_reserve: [0x0000000049000000-0x0000000058ffffff]
>>>> memblock_alloc_range_nid+0xee/0x170
>>>>> [ 0.022640] crashkernel low memory reserved: 0x49000000 - 0x59000000
>>>> (256 MB)
>>>>> [ 0.022641] crashkernel reserved: 0x000000c01f000000 -
>>>> 0x000000c03f000000 (512 MB)
>>>>
>>>> Here, crashkernel,low is reserved in region: [0x49000000 - 0x59000000] (256
>>>> MB)
>>>> crashkernel,high is reserved in region: [0x000000c01f000000 -
>>>> 0x000000c03f000000] (512 MB) ......
>>>>> [ 0.029839] memblock_reserve: [0x000000c03ffff740-0x000000c03fffff7f]
>>>> memblock_alloc_range_nid+0xee/0x170
>>>>> [ 0.029843] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
>>>> reserved
>>>>> [ 0.029861] TSC deadline timer available
>>>>
>>>> Then here, region [0x53cbd000-0x53ccffff] is reserved in e820, and print abvoe
>>>> "usable ==> reserved". This should be the step which prevents earlier reserved
>>>> crashkernel,low from being added to iomem tree. I am not sure what triggered
>>>> the e820 update.
>>
>> We added dump_stack () printing in efi_mem_reserve () and found that
>> [0x53cbd000-0x53ccffff] was reserved by BGRT:
>>
>> [ 0.032259] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
>> reserved
>> [ 0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
>> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
>> [ 0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
>> 08/30/2022
>> [ 0.032264] Call Trace:
>> [ 0.032265] ? dump_stack+0x57/0x6e
>> [ 0.032267] ? bgrt_init+0xc2/0xc2
>> [ 0.032268] ? __e820__range_update+0x7a/0x1d6
>> [ 0.032270] ? bgrt_init+0xc2/0xc2
>> [ 0.032272] ? bgrt_init+0xc2/0xc2
>> [ 0.032274] ? efi_arch_mem_reserve+0x1a3/0x1d0
>> [ 0.032276] ? efi_mem_reserve+0x2d/0x42
>> [ 0.032278] ? acpi_parse_bgrt+0xa/0x11
>> [ 0.032279] ? acpi_table_parse+0x86/0xbc
>> [ 0.032281] ? acpi_boot_init+0x79/0xad
>> [ 0.032282] ? setup_arch+0x835/0x954
>> [ 0.032284] ? start_kernel+0x5d/0x455
>> [ 0.032286] ? secondary_startup_64_no_verify+0xc2/0xcb
>>
>> efi_reserve_boot_services() has reserved memory of type
>> EFI_BOOT_SERVICES_CODE & EFI_BOOT_SERVICES_DATA before crashkernel.
>> efi_bgrt_init() assumes that EFI_BOOT_SERVICES_DATA is not reserved by
>> other modules. Then, the e820_table is directly updated, and the BGRT
>> memory is reserved.
>>
>> However, memblock_is_region_reserved() in efi_reserve_boot_services()
>> returns true when the ranges only overlap.
>>
>> already_reserved = memblock_is_region_reserved(start, size);
>
> Do you mean efi_reserve_boot_services is supposed to reserve the bgrt
> memory but it does not reserve it due to the region overlapping with
> some other reserved region? If so can you debug and find what exact
> memblock reserved region overlaps with the bgrt?
Yes. I added the following debug print to efi_reserve_boot_services():
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -339,6 +339,10 @@ void __init efi_reserve_boot_services(void)
already_reserved = memblock_is_region_reserved(start, size);
+ pr_info("kdumpdebug: efi_reserve_boot_services start 0x%lu, "
+ "size 0x%lx, type 0x%lx, already_reserved %d\n",
+ start, size, md->type, already_reserved);
+
/*
* Because the following memblock_reserve() is paired
* with memblock_free_late() for this region in
This memory [0x0000005976a018-0x00000005976abc7] is reserved here, which belongs to EFI_BOOT_SERVICES_DATA.
[ 0.000000] memblock_reserve: [0x000000005976a018-0x000000005976abc7] efi_memattr_init+0x51/0xa0
It falls in the following range
[ 0.000000] efi: mem22: [Boot Data | | | | | | | | | | |WB|WT|WC|UC] range=[0x0000000051329000-0x000000005cefefff] (187MB)
in efi_reserve_boot_services(), [0x0000005132900-0x00000005cefeff] will not be fully reserved because [0x0000005976a018-0x0000005976abc7]
has already been reserved and overlaps with [0x0000005976a018-0x0000005976abc7]
[ 0.021316] efi: kdumpdebug: efi_reserve_boot_services start 0x51329000, size 0xbbd6000, type 0x4, already_reserved 1
It is not reserved by memblock, this free memory region is allocated by crashkernel
[ 0.022597] crashkernel low memory reserved: 0x49000000 - 0x59000000 (256 MB)
[ 0.022599] crashkernel reserved: 0x000000c01f000000 - 0x000000c03f000000 (512 MB)
In efi_bgrt_init (), it is assumed that the memory of the EFI_BOOT_SERVICES_DATA type has been successfully
reserved. Therefore, the address in the range is directly used. As a result, the memory overlaps with
the crashkernel region.
[ 0.029694] e820: update [mem 0x53cbd000-0x53ccffff] usable ==> reserved
>
> BTW, the previous email threads are weird, and not threading
> correctly, hard to find information.
It should be because the log content is too large and has been put on hold. In my previous email, I received a prompt:
The reason it is being held:
Message body is too big: 248998 bytes with a limit of 40 KB
>
>>
>> /*
>> * Because the following memblock_reserve() is paired
>> * with memblock_free_late() for this region in
>> * efi_free_boot_services(), we must be extremely
>> * careful not to reserve, and subsequently free,
>> * critical regions of memory (like the kernel image) or
>> * those regions that somebody else has already
>> * reserved.
>> *
>> * A good example of a critical region that must not be
>> * freed is page zero (first 4Kb of memory), which may
>> * contain boot services code/data but is marked
>> * E820_TYPE_RESERVED by trim_bios_range().
>> */
>> if (!already_reserved) {
>> memblock_reserve(start, size);
>>
>> /*
>> * If we are the first to reserve the region, no
>> * one else cares about it. We own it and can
>> * free it later.
>> */
>> if (can_free_region(start, size))
>> continue;
>> }
>>
>> As a result, some memory of EFI_BOOT_SERVICES_DATA is not reserved in
>> advance. The subsequent crashkernel happens to reserve this portion of
>> memory, which conflicts with BGRT.
>>
>>> Current analysis suggests that efi_reserve_boot_services() is causing the update of the e820 table.
>>>
>>>>
>>>> How do you boot into your new 6.8.0 kernel? Used kexec -l to jump into the 2nd
>>>> kernel, or reboot from bios/firmware boot up into 6.8.0?
>>> It's reboot from bios boot up into 6.8.0. I attempted to revert the below patch,
>>> and this time the conflicting segment "53cbd000-53ccffff" also appeared in the /proc/iomem
>>> of the 6.8 kernel.
>>>
>>> 2d4fd058-60efefff : System RAM
>>> 2d4fd058-58ffffff : System RAM
>>> 49000000-58ffffff : Crash kernel
>>> 53cbd000-53ccffff : Reserved
>>> 60eff000-704fefff : Reserved
>>> --
>>> 93dd424000-93dd9fffff : Kernel bss
>>> c01f000000-c03effffff : Crash kernel
>>> d0000000000-d0fffffffff : PCI Bus 0000:00
>>> d0000000000-d00001fffff : PCI Bus 0000:01
>>>>
>>>> Reverting below commit should fix your problem, can you try it?
>>>>
>>>> commit 4a693ce65b186fddc1a73621bd6f941e6e3eca21
>>>> Author: Huacai Chen <chenhuacai at kernel.org>
>>>> Date: Fri Dec 29 16:02:13 2023 +0800
>>>>
>>>> kdump: defer the insertion of crashkernel resources
>>>
>>> .
>>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>
> .
>
More information about the kexec
mailing list