Question about Address Range Validation in Crash Kernel Allocation

Li Huafei lihuafei1 at huawei.com
Thu Mar 21 02:48:46 PDT 2024


Hi Baoquan,

On 2024/3/21 17:17, chenhaixiang (A) wrote:
> 
>>> I'm sorry for the delay. Here are some details from the boot log and
>> /proc/iomem:
>>> The Boot log:
>>> [    0.000000] Linux version 6.8.0 (root at localhost.localdomain) (gcc (GCC)
>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
>> 11:46:11 UTC 2024
>>> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers softlockup_panic=1
>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
>> console=ttyS0,115200n8 console=tty0
>> ......snip...
>>> [    0.022622] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
>> from=0x0000000000000000 max_addr=0x0000000100000000
>> reserve_crashkernel_generic+0x7c/0x220
>>> [    0.022628] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
>> from=0x0000000100000000 max_addr=0x0000400000000000
>> reserve_crashkernel_generic+0x7c/0x220
>>> [    0.022632] memblock_reserve: [0x000000c01f000000-0x000000c03effffff]
>> memblock_alloc_range_nid+0xee/0x170
>>> [    0.022634] memblock_phys_alloc_range: 268435456 bytes align=0x1000000
>> from=0x0000000000000000 max_addr=0x0000000100000000
>> reserve_crashkernel_generic+0x11d/0x220
>>> [    0.022638] memblock_reserve: [0x0000000049000000-0x0000000058ffffff]
>> memblock_alloc_range_nid+0xee/0x170
>>> [    0.022640] crashkernel low memory reserved: 0x49000000 - 0x59000000
>> (256 MB)
>>> [    0.022641] crashkernel reserved: 0x000000c01f000000 -
>> 0x000000c03f000000 (512 MB)
>>
>> Here, crashkernel,low is reserved in region:  [0x49000000 - 0x59000000] (256
>> MB)
>>       crashkernel,high is reserved in region: [0x000000c01f000000 -
>> 0x000000c03f000000] (512 MB) ......
>>> [    0.029839] memblock_reserve: [0x000000c03ffff740-0x000000c03fffff7f]
>> memblock_alloc_range_nid+0xee/0x170
>>> [    0.029843] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
>> reserved
>>> [    0.029861] TSC deadline timer available
>>
>> Then here, region [0x53cbd000-0x53ccffff] is reserved in e820, and print abvoe
>> "usable ==> reserved". This should be the step which prevents earlier reserved
>> crashkernel,low from being added to iomem tree. I am not sure what triggered
>> the e820 update.

We added dump_stack () printing in efi_mem_reserve () and found that
[0x53cbd000-0x53ccffff] was reserved by BGRT:

  [    0.032259] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
reserved
  [    0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
  [    0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
08/30/2022
  [    0.032264] Call Trace:
  [    0.032265]  ? dump_stack+0x57/0x6e
  [    0.032267]  ? bgrt_init+0xc2/0xc2
  [    0.032268]  ? __e820__range_update+0x7a/0x1d6
  [    0.032270]  ? bgrt_init+0xc2/0xc2
  [    0.032272]  ? bgrt_init+0xc2/0xc2
  [    0.032274]  ? efi_arch_mem_reserve+0x1a3/0x1d0
  [    0.032276]  ? efi_mem_reserve+0x2d/0x42
  [    0.032278]  ? acpi_parse_bgrt+0xa/0x11
  [    0.032279]  ? acpi_table_parse+0x86/0xbc
  [    0.032281]  ? acpi_boot_init+0x79/0xad
  [    0.032282]  ? setup_arch+0x835/0x954
  [    0.032284]  ? start_kernel+0x5d/0x455
  [    0.032286]  ? secondary_startup_64_no_verify+0xc2/0xcb

efi_reserve_boot_services() has reserved memory of type
EFI_BOOT_SERVICES_CODE & EFI_BOOT_SERVICES_DATA  before crashkernel.
efi_bgrt_init() assumes that EFI_BOOT_SERVICES_DATA is not reserved by
other modules. Then, the e820_table is directly updated, and the BGRT
memory is reserved.

However, memblock_is_region_reserved() in efi_reserve_boot_services()
returns true when the ranges only overlap.

     already_reserved = memblock_is_region_reserved(start, size);

     /*
      * Because the following memblock_reserve() is paired
      * with memblock_free_late() for this region in
      * efi_free_boot_services(), we must be extremely
      * careful not to reserve, and subsequently free,
      * critical regions of memory (like the kernel image) or
      * those regions that somebody else has already
      * reserved.
      *
      * A good example of a critical region that must not be
      * freed is page zero (first 4Kb of memory), which may
      * contain boot services code/data but is marked
      * E820_TYPE_RESERVED by trim_bios_range().
      */
     if (!already_reserved) {
             memblock_reserve(start, size);

             /*
              * If we are the first to reserve the region, no
              * one else cares about it. We own it and can
              * free it later.
              */
             if (can_free_region(start, size))
                     continue;
     }

As a result, some memory of EFI_BOOT_SERVICES_DATA is not reserved in
advance. The subsequent crashkernel happens to reserve this portion of
memory, which conflicts with BGRT.

> Current analysis suggests that efi_reserve_boot_services() is causing the update of the e820 table.
> 
>>
>> How do you boot into your new 6.8.0 kernel? Used kexec -l to jump into the 2nd
>> kernel, or reboot from bios/firmware boot up into 6.8.0?
> It's reboot from bios boot up into 6.8.0. I attempted to revert the below patch,
>  and this time the conflicting segment "53cbd000-53ccffff" also appeared in the /proc/iomem
>  of the 6.8 kernel.
> 
> 2d4fd058-60efefff : System RAM
>   2d4fd058-58ffffff : System RAM
>     49000000-58ffffff : Crash kernel
>       53cbd000-53ccffff : Reserved
> 60eff000-704fefff : Reserved
> --
>   93dd424000-93dd9fffff : Kernel bss
>   c01f000000-c03effffff : Crash kernel
> d0000000000-d0fffffffff : PCI Bus 0000:00
>   d0000000000-d00001fffff : PCI Bus 0000:01
>>
>> Reverting below commit should fix your problem, can you try it?
>>
>> commit 4a693ce65b186fddc1a73621bd6f941e6e3eca21
>> Author: Huacai Chen <chenhuacai at kernel.org>
>> Date:   Fri Dec 29 16:02:13 2023 +0800
>>
>>     kdump: defer the insertion of crashkernel resources
> 
> .
> 



More information about the kexec mailing list