[PATCH makedumpfile] Fix incorrect PFN exclusion when LOAD segments overlap

Tue Oct 8 20:32:21 PDT 2024

Hi Kazu,

Thank you for your insightful analysis and suggestion!

在 2024/10/9 09:47, HAGIO KAZUHITO(萩尾 一仁) 写道:
> Hi Ming,
> 
> thank you for the detailed description.
> 
> On 2024/10/08 17:43, Ming Wang wrote:
> 
>>> On 2024/08/14 11:43, Ming Wang wrote:
>>>> When iterating through LOAD segments to exclude pages in
>>>> `exclude_nodata_pages`, simply using `phys_end` to track allocated
>>>> PFNs can lead to the erroneous exclusion of valid pages from a
>>>> subsequent LOAD segment if its `phys_start` overlaps with the
>>>> previous segment's address range.
>>>>
>>>> This patch addresses the issue by checking for such overlaps with the
>>>> next LOAD segment's physical address range. If an overlap is
>>>> detected, we continue including PFNs from the current segment until
>>>> reaching the end of the overlapping region.
>>>>
>>>> This fix ensures that all valid pages within overlapping LOAD segments
>>>> are correctly included during PFN exclusion.
>>>
>>> I have a few questions:
>>>
>>> - I heard from Masa that the overlap is caused by a bug, isn't it fixed
>>> on the kernel side?  or it will be fixed but is this patch for existing
>>> kernels that have the bug?
>> The overlap you mentioned is not caused by a kernel bug and is not expected to be fixed in the kernel.
> 
> I see.
> 
>>>
>>> - exclude_nodata_pages() is for dumpfiles created by "makedumpfile -E",
>>> which uses p_filesz in program headers to record existing pages after
>>> exclusion.  The function excludes pages only between
>>> phys_start+file_size and phys_end, so if p_filesz == p_memsz, it will
>>> not exclude any pages.
>> My system's page size is 16K, Consider the following LOAD segment information
>>
>> phys_start         phys_end       virt_start         virt_end
>> LOAD[ 0]           200000          2186200 9000000000200000 9000000002186200
>> LOAD[ 1]           200000          ee00000 9000000000200000 900000000ee00000
>>
>> I added some print statements within the exclude_nodata_pages function:
>>
>> while (pfn < pfn_end) {
>>       ERRMSG("\nfunc:%s: pfn: %#llx phys_start: %#llx phys_end: %#llx file_size: %lu pfn_end:%#llx\n"
>>           , __func__, pfn, phys_start, phys_end, (unsigned long)file_size, pfn_end);
>>       clear_bit_on_2nd_bitmap(pfn, cycle);
>>       ++pfn;
>> }
>>
>> The following output was observed:
>>
>> func:exclude_nodata_pages: pfn: 0x861 phys_start: 0x200000 phys_end: 0x2186200 file_size: 33055232 pfn_end:0x862
>>
>> After this print statement, clear_bit_on_2nd_bitmap is called, effectively clearing the bitmap bit for pfn: 0x861.
>> However, pfn: 0x861 is a valid PFN that happens to store the kernel's init_uts_ns. This leads to an error when the
>> crash tool attempts to detect the kernel version:
>>
>> read_diskdump: PAGE_EXCLUDED: paddr/pfn: 2185d98/861
>> crash: page excluded: kernel virtual address: 9000000002185d98  type: "init_uts_ns"
> 
> So the calculated pfn:0x861 is wrong?
Yes, it's wrong.
> 
>> LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
>>                  0x0000000002550400 0x0000000002550400  RWE    0x0
> 
> This is a different example, it's expected that p_memsz equals to p_filesz.
> 
> But, the sizes are not aligned to the page size 16k, and
> 
>     pfn = paddr_to_pfn(phys_start + file_size);
> 
> it looks like this rounds down the last valid page..?  so
> 
> -               pfn = paddr_to_pfn(phys_start + file_size);
> +               pfn = paddr_to_pfn(roundup(phys_start + file_size, PAGESIZE());
>                   pfn_end = paddr_to_pfn(roundup(phys_end, PAGESIZE()));
> 
> How about this instead of your patch?
I agree that rounding up the calculated `pfn` to the nearest page 
boundary is a better approach to ensure that the last valid page of a 
LOAD segment is not accidentally excluded.

I've implemented your suggestion:

- pfn = paddr_to_pfn(phys_start + file_size);
+ pfn = paddr_to_pfn(roundup(phys_start + file_size, PAGESIZE()));

and tested it thoroughly. It effectively resolves the issue of valid 
pages being excluded due to rounding errors.

I've updated the patch accordingly and will be send v2 soon.

Thanks again for your help!

Best regards,
Ming
> 
> Thanks,
> Kazu
> 
> 
>>> I would like to check how they are on your machine, so could I have a
>>> "readelf -l /proc/vmcore" output?
>> The output of the command readelf -l /proc/vmcore executed on my system is provided below.
>>
>> It is important to note that this output corresponds to a different kernel version compared to the one used in the preceding example:
>>
>> [root at localhost ~]# readelf -l /proc/vmcore
>>
>> Elf file type is CORE (Core file)
>> Entry point 0x0
>> There are 8 program headers, starting at offset 64
>>
>> Program Headers:
>> Type           Offset             VirtAddr           PhysAddr
>> FileSiz            MemSiz              Flags  Align
>> NOTE           0x0000000000004000 0x0000000000000000 0x0000000000000000
>>                  0x0000000000004a54 0x0000000000004a54         0x4
>> LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
>>                  0x0000000002550400 0x0000000002550400  RWE    0x0
>> LOAD           0x0000000002560000 0x9000000000200000 0x0000000000200000
>>                  0x000000000ec00000 0x000000000ec00000  RWE    0x0
>> LOAD           0x0000000011160000 0x9000000090400000 0x0000000090400000
>>                  0x0000000011200000 0x0000000011200000  RWE    0x0
>> LOAD           0x0000000022360000 0x90000000f8e00000 0x00000000f8e00000
>>                  0x00000000000c0000 0x00000000000c0000  RWE    0x0
>> LOAD           0x0000000022420000 0x90000000f8ed0000 0x00000000f8ed0000
>>                  0x0000000004c70000 0x0000000004c70000  RWE    0x0
>> LOAD           0x0000000027090000 0x90000000fe100000 0x00000000fe100000
>>                  0x0000000f81f00000 0x0000000f81f00000  RWE    0x0
>> LOAD           0x0000000fa8f90000 0x9000100080000000 0x0000100080000000
>>                  0x0000001000000000 0x0000001000000000  RWE    0x0
>>>
>>> - Do you mean that the bug always puts the two LOAD segments that have
>>> the same phys_start in a row?
>> Yes, as demonstrated in the example above.
>>
>> Thanks,
>> Ming