[PATCH v2 0/5] kdump: crashkernel reservation from CMA
Donald Dutile
ddutile at redhat.com
Mon Mar 3 06:17:51 PST 2025
On 3/3/25 3:25 AM, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
>> Hi,
>>
>> this series implements a way to reserve additional crash kernel
>> memory using CMA.
>>
>> Link to the v1 discussion:
>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>> See below for the changes since v1 and how concerns from the
>> discussion have been addressed.
>>
>> Currently, all the memory for the crash kernel is not usable by
>> the 1st (production) kernel. It is also unmapped so that it can't
>> be corrupted by the fault that will eventually trigger the crash.
>> This makes sense for the memory actually used by the kexec-loaded
>> crash kernel image and initrd and the data prepared during the
>> load (vmcoreinfo, ...). However, the reserved space needs to be
>> much larger than that to provide enough run-time memory for the
>> crash kernel and the kdump userspace. Estimating the amount of
>> memory to reserve is difficult. Being too careful makes kdump
>> likely to end in OOM, being too generous takes even more memory
>> from the production system. Also, the reservation only allows
>> reserving a single contiguous block (or two with the "low"
>> suffix). I've seen systems where this fails because the physical
>> memory is fragmented.
>>
>> By reserving additional crashkernel memory from CMA, the main
>> crashkernel reservation can be just large enough to fit the
>> kernel and initrd image, minimizing the memory taken away from
>> the production system. Most of the run-time memory for the crash
>> kernel will be memory previously available to userspace in the
>> production system. As this memory is no longer wasted, the
>> reservation can be done with a generous margin, making kdump more
>> reliable. Kernel memory that we need to preserve for dumping is
>> never allocated from CMA. User data is typically not dumped by
>> makedumpfile. When dumping of user data is intended this new CMA
>> reservation cannot be used.
>
>
> Hi,
>
> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
>
> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
>
> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
>
IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).
> Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
>
I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.
>>
>> There are five patches in this series:
>>
>> The first adds a new ",cma" suffix to the recenly introduced generic
>> crashkernel parsing code. parse_crashkernel() takes one more
>> argument to store the cma reservation size.
>>
>> The second patch implements reserve_crashkernel_cma() which
>> performs the reservation. If the requested size is not available
>> in a single range, multiple smaller ranges will be reserved.
>>
>> The third patch updates Documentation/, explicitly mentioning the
>> potential DMA corruption of the CMA-reserved memory.
>>
>> The fourth patch adds a short delay before booting the kdump
>> kernel, allowing pending DMA transfers to finish.
>
>
> What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
>
Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
DMA termination from VMs is stunted ? is that already done today (due to 'simple', single-stage, device assignment in a VM)?
> But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
>
> If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
>
>
> [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
>
More information about the kexec
mailing list