[PATCH 0/1] Fix for riscv vmcore issue

Wed Jul 16 11:16:09 PDT 2025

On 7/16/25 14:47, Alexandre Ghiti wrote:
> 
> 
> I'm still not in favor with this solution, that does not sound right.
> 
> I think we should not exclude the "Reserved" regions which lie inside 
> "System RAM" region. The problem is that we mark "nomap" regions as 
> "Reserved" too, I would say this is where we are wrong: "nomap" regions 
> don't even have direct mapping, so they should be presented as a hole 
> rather than "Reserved". And that would allow us to not exclude the 
> "Reserved" regions.
> 
> @Simon, @Pnina WDYT?
> 
> Thanks,
> 
> Alex
> 
> 
NOMAP means the region is reserved:

https://elixir.bootlin.com/linux/v6.16-rc6/source/include/linux/memblock.h#L36

* @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
* reserved in the memory map; refer to memblock_mark_nomap() description
* for further details

https://elixir.bootlin.com/linux/v6.16-rc6/source/mm/memblock.c#L1060

* The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
* direct mapping of the physical memory. These regions will still be
* covered by the memory map. The struct page representing NOMAP memory
* frames in the memory map will be PageReserved()
*
* Note: if the memory being marked %MEMBLOCK_NOMAP was allocated from
* memblock, the caller must inform kmemleak to ignore that memory

This is also what ARM64 does btw:

https://elixir.bootlin.com/linux/v6.16-rc6/source/arch/arm64/kernel/setup.c#L230

Sorry I didn't review this earlier...

* In the original kexec-tools port I added a function 
dtb_get_memory_ranges that parsed the device tree (either through 
/sys/firmware/fdt, or a user-provided one) for memory regions, including 
reserved memory regions added there e.g. by OpenSBI. This works better 
than using /proc/iomem, since /proc/iomem captures the memory layout of 
the running system, not the system we are going to boot to, and also 
/proc/iomem is not a standardized interface, which was yet another 
reason I wanted to avoid using it. I can further argue why that approach 
is better but it's a bit off topic here, and since we now have EFI in 
play it needed review anyway (when I wrote that we didn't have ACPI/EFI 
support, things were nice and clean). The thing is I'd prefer if there 
was still an option to use it, the function was upstreamed but it's not 
called anymore, please fix that, not everyone uses ACPI/EFI or cares 
about it (actually almost all RISC-V systems I've seen in production use 
a device tree). In our use cases where we e.g. swap accelerators on 
FPGAs this approach is much simpler to follow and implement, than having 
to re-generate ACPI tables for our target system everytime (that will 
have a new accelerator in place with its own reserved regions etc, that 
we don't want to overlap with e.g. initrd or the kernel image). Also 
keep in mind that kexec is not there just for kdump, it's a very useful 
feature for other use cases as well.

* For creating the elfcorehdr for kdump (that ends up being used for 
/proc/vmcore in the crashkernel), we obviously need runtime information 
from the running kernel (the one we want to analyze if it crashes), so 
the device tree couldn't obviously provide that, the standardized 
interface was supposed to be /proc/kcore, but that's also a security 
mess and is considered obsolete as far as I know (back when I was 
working on that I remember it was considered for removal). Which is why 
I used /proc/iomem and /proc/kallsyms in load_elfcorehdr like other 
archs do on kexec-tools, to determine the address of specific symbols to 
populate struct crash_elf_info, and also exclude the range of the 
crashkernel allocated at runtime (which I also exported via 
/proc/iomem). I still used the memory regions from the device tree 
(info->memory_ranges populated via dtb_get_memory_ranges) which meant 
that everything was there.

* When the code changed to rely only on /proc/iomem, although 
get_memory_ranges was changed, load_elfcorehdr and the regions exported 
via /proc/iomem remained the same. In the device tree scenario this 
still worked, since init_resources exposed basically the same regions as 
in the device tree. It should also work for the EFI scenario, by the 
time we reach init_resources we've already called efi_init -> 
reserve_regions, so we'll add both reserved and nomap regions there. For 
the soft-reserved regions we didn't add in efi_init, we add them later 
on in riscv_enable_runtime_services as "Soft Reserved" resources. I 
don't see why we need to add an arch initcall to run after that on 
resources we already added (on init_resources btw, ignoring those added 
by riscv_enable_runtime_services). Why do we need to do that ? Note that 
init_resources does two passes unlike arm64's approach, so it should 
handle overlapping regions properly.

Am I missing something ?

Regards,
Nick