[PATCH makedumpfile] Fix incorrect PFN exclusion when LOAD segments overlap

Tue Oct 8 18:47:40 PDT 2024

Hi Ming,

thank you for the detailed description.

On 2024/10/08 17:43, Ming Wang wrote:

>> On 2024/08/14 11:43, Ming Wang wrote:
>>> When iterating through LOAD segments to exclude pages in
>>> `exclude_nodata_pages`, simply using `phys_end` to track allocated
>>> PFNs can lead to the erroneous exclusion of valid pages from a
>>> subsequent LOAD segment if its `phys_start` overlaps with the
>>> previous segment's address range.
>>>
>>> This patch addresses the issue by checking for such overlaps with the
>>> next LOAD segment's physical address range. If an overlap is
>>> detected, we continue including PFNs from the current segment until
>>> reaching the end of the overlapping region.
>>>
>>> This fix ensures that all valid pages within overlapping LOAD segments
>>> are correctly included during PFN exclusion.
>>
>> I have a few questions:
>>
>> - I heard from Masa that the overlap is caused by a bug, isn't it fixed
>> on the kernel side?  or it will be fixed but is this patch for existing
>> kernels that have the bug?
> The overlap you mentioned is not caused by a kernel bug and is not expected to be fixed in the kernel.

I see.

>>
>> - exclude_nodata_pages() is for dumpfiles created by "makedumpfile -E",
>> which uses p_filesz in program headers to record existing pages after
>> exclusion.  The function excludes pages only between
>> phys_start+file_size and phys_end, so if p_filesz == p_memsz, it will
>> not exclude any pages.
> My system's page size is 16K, Consider the following LOAD segment information
> 
> phys_start         phys_end       virt_start         virt_end
> LOAD[ 0]           200000          2186200 9000000000200000 9000000002186200
> LOAD[ 1]           200000          ee00000 9000000000200000 900000000ee00000
> 
> I added some print statements within the exclude_nodata_pages function:
> 
> while (pfn < pfn_end) {
>      ERRMSG("\nfunc:%s: pfn: %#llx phys_start: %#llx phys_end: %#llx file_size: %lu pfn_end:%#llx\n"
>          , __func__, pfn, phys_start, phys_end, (unsigned long)file_size, pfn_end);
>      clear_bit_on_2nd_bitmap(pfn, cycle);
>      ++pfn;
> }
> 
> The following output was observed:
> 
> func:exclude_nodata_pages: pfn: 0x861 phys_start: 0x200000 phys_end: 0x2186200 file_size: 33055232 pfn_end:0x862
> 
> After this print statement, clear_bit_on_2nd_bitmap is called, effectively clearing the bitmap bit for pfn: 0x861.
> However, pfn: 0x861 is a valid PFN that happens to store the kernel's init_uts_ns. This leads to an error when the
> crash tool attempts to detect the kernel version:
> 
> read_diskdump: PAGE_EXCLUDED: paddr/pfn: 2185d98/861
> crash: page excluded: kernel virtual address: 9000000002185d98  type: "init_uts_ns"

So the calculated pfn:0x861 is wrong?

> LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
>                 0x0000000002550400 0x0000000002550400  RWE    0x0

This is a different example, it's expected that p_memsz equals to p_filesz.

But, the sizes are not aligned to the page size 16k, and

   pfn = paddr_to_pfn(phys_start + file_size);

it looks like this rounds down the last valid page..?  so

-               pfn = paddr_to_pfn(phys_start + file_size);
+               pfn = paddr_to_pfn(roundup(phys_start + file_size, PAGESIZE());
                 pfn_end = paddr_to_pfn(roundup(phys_end, PAGESIZE()));

How about this instead of your patch?

Thanks,
Kazu

>> I would like to check how they are on your machine, so could I have a
>> "readelf -l /proc/vmcore" output?
> The output of the command readelf -l /proc/vmcore executed on my system is provided below.
> 
> It is important to note that this output corresponds to a different kernel version compared to the one used in the preceding example:
> 
> [root at localhost ~]# readelf -l /proc/vmcore
> 
> Elf file type is CORE (Core file)
> Entry point 0x0
> There are 8 program headers, starting at offset 64
> 
> Program Headers:
> Type           Offset             VirtAddr           PhysAddr
> FileSiz            MemSiz              Flags  Align
> NOTE           0x0000000000004000 0x0000000000000000 0x0000000000000000
>                 0x0000000000004a54 0x0000000000004a54         0x4
> LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
>                 0x0000000002550400 0x0000000002550400  RWE    0x0
> LOAD           0x0000000002560000 0x9000000000200000 0x0000000000200000
>                 0x000000000ec00000 0x000000000ec00000  RWE    0x0
> LOAD           0x0000000011160000 0x9000000090400000 0x0000000090400000
>                 0x0000000011200000 0x0000000011200000  RWE    0x0
> LOAD           0x0000000022360000 0x90000000f8e00000 0x00000000f8e00000
>                 0x00000000000c0000 0x00000000000c0000  RWE    0x0
> LOAD           0x0000000022420000 0x90000000f8ed0000 0x00000000f8ed0000
>                 0x0000000004c70000 0x0000000004c70000  RWE    0x0
> LOAD           0x0000000027090000 0x90000000fe100000 0x00000000fe100000
>                 0x0000000f81f00000 0x0000000f81f00000  RWE    0x0
> LOAD           0x0000000fa8f90000 0x9000100080000000 0x0000100080000000
>                 0x0000001000000000 0x0000001000000000  RWE    0x0
>>
>> - Do you mean that the bug always puts the two LOAD segments that have
>> the same phys_start in a row?
> Yes, as demonstrated in the example above.
> 
> Thanks,
> Ming