[PATCH v3 0/6] crashdump: Kernel handling of CPU and memory hot un/plug
Eric DeVolder
eric.devolder at oracle.com
Wed Oct 4 11:23:47 PDT 2023
On 10/4/23 07:08, Simon Horman wrote:
> On Wed, Sep 27, 2023 at 02:11:30PM -0400, Eric DeVolder wrote:
>> When the kdump service is loaded, if a CPU or memory is hot
>> un/plugged, the crash elfcorehdr, which describes the CPUs and memory
>> in the system, must also be updated, else the resulting vmcore is
>> inaccurate (eg. missing either CPU context or memory regions).
>>
>> The current solution utilizes udev (eg. RHEL /usr/lib/udev/rules.d/
>> 98-kexec.rules) to initiate an unload-then-reload of the *entire* kdump
>> image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by
>> the userspace kexec utility. This occurrs just so the elfcorehdr can
>> be updated with the latest list of CPUs and memory regions. In a
>> previous post I have outlined the significant performance problems
>> related to offloading this activity to userspace.
>>
>> With the Linux kernel 6.6 commit below, the kernel now has the ability
>> to directly modify the elfcorehdr, eliminating the need to
>> unload-then-reload the entire kdump image when CPU or memory is hot
>> un/plugged or on/offlined.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6
>> 8b4b6f307d155475cce541f2aee938032ed22e
>>
>> This kexec-tools patch series is for supporting hotplug with the
>> kexec_load() syscall; the kernel directly supports hotplug for the
>> kexec_file_load() syscall, requiring no userspace help.
>>
>> There are two basic obstacles/requirements for the kexec-tools to
>> overcome in order to support kernel hotplug rewriting of the
>> elfcorehdr.
>>
>> First, the buffer containing the elfcorehdr must be excluded from the
>> purgatory checksum/digest, which is computed at load time. Otherwise
>> kernel run-time changes to the elfcorehdr, as a result of hot un/plug,
>> would result in the checksum failing (specifically in purgatory at
>> panic kernel boot time), and kdump capture kernel failing to start.
>> To let the kernel know it is okay to modify the elfcorehdr, kexec
>> sets the KEXEC_UPDATE_ELFCOREHDR flag.
>>
>> NOTE: The kernel specifically does *NOT* attempt to recompute the
>> checksum/digest as that would ultimately require patching the in-
>> memory purgatory image with the updated checksum. As that purgatory
>> image is already fully linked, it is binary blob containing no ELF
>> information which would allow it to be re-linked or patched. Thus
>> excluding the elfcorehdr from the checksum/digests avoids all these
>> problems.
>>
>> Second, the size of the elfcorehdr buffer must be large enough
>> to accomodate growth of the number of CPUs and/or memory regions.
>>
>> To satisfy the first requirement, this patch series introduces the
>> --hotplug option to indicate to kexec-tools that kexec should exclude
>> the elfcorehdr buffer from the purgatory checksum/digest calculation
>> and set the KEXEC_UPDATE_ELFCOREHDR flag.
>>
>> To satisfy the second requirement, the size is obtained from the
>> /sys/kernel/crash_elfcorehdr_size node (new with the kernel series
>> cited above).
>>
>> To use this feature with kexec_load() syscall, invoke kexec with:
>>
>> kexec -c --hotplug ...
>>
>> Thanks!
>> eric
>
> Thanks Eric,
>
> applied.
Excellent, thank you!
eric
More information about the kexec
mailing list