[PATCH v21 5/7] x86/crash: add x86 crash hotplug support

Mon May 1 11:33:47 PDT 2023

On 4/28/23 13:31, Hari Bathini wrote:
> 
> On 28/04/23 2:55 pm, Baoquan He wrote:
>> On 04/27/23 at 10:26pm, Hari Bathini wrote:
>>> On 27/04/23 2:19 pm, Baoquan He wrote:
>>>> On 04/27/23 at 12:39pm, Hari Bathini wrote:
>>>>> Hi Eric,
>>>>>
>>>>> On 04/04/23 11:33 pm, Eric DeVolder wrote:
>>>>>> When CPU or memory is hot un/plugged, or off/onlined, the crash
>>>>>> elfcorehdr, which describes the CPUs and memory in the system,
>>>>>> must also be updated.
>>>>>>
>>>>>> The segment containing the elfcorehdr is identified at run-time
>>>>>> in crash_core:crash_handle_hotplug_event(), which works for both
>>>>>> the kexec_load() and kexec_file_load() syscalls. A new elfcorehdr
>>>>>> is generated from the available CPUs and memory into a buffer,
>>>>>> and then installed over the top of the existing elfcorehdr.
>>>>>>
>>>>>> In the patch 'kexec: exclude elfcorehdr from the segment digest'
>>>>>> the need to update purgatory due to the change in elfcorehdr was
>>>>>> eliminated.  As a result, no changes to purgatory or boot_params
>>>>>> (as the elfcorehdr= kernel command line parameter pointer
>>>>>> remains unchanged and correct) are needed, just elfcorehdr.
>>>>>>
>>>>>> To accommodate a growing number of resources via hotplug, the
>>>>>> elfcorehdr segment must be sufficiently large enough to accommodate
>>>>>> changes, see the CRASH_MAX_MEMORY_RANGES description. This is used
>>>>>> only on the kexec_file_load() syscall; for kexec_load() userspace
>>>>>> will need to size the segment similarly.
>>>>>>
>>>>>> To accommodate kexec_load() syscall in the absence of
>>>>>
>>>>> Firstly, thanks! This series is a nice improvement to kdump support
>>>>> in hotplug environment.
Thank you!

>>>>>
>>>>> One concern though is that this change assumes corresponding support
>>>>> in kexec-tools. Without that support kexec_load would fail to boot
>>>>> with digest verification failure, iiuc.

Yes, you've correctly identified that if a hotplug change occurs following kexec_load
(made with kexec-tools unaltered for hotplug), then a subsequent panic would in fact
fail the purgatory digest verification, and  kdump would not happen.

>>>>
>>>> Eric has posted patchset to modify kexec_tools to support that, please
>>>> see the link Eric pasted in the cover letter.
>>>>
>>>> http://lists.infradead.org/pipermail/kexec/2022-October/026032.html
>>>
>>> Right, Baoquan.
>>>
>>> I did see that and if I read the code correctly, without that patchset
>>> kexec_load would fail. Not with an explicit error that hotplug support
>>> is missing or such but it would simply fail to boot into capture kernel
>>> with digest verification failure.
This is correct.

>>>
>>> My suggestion was to avoid that userspace tool breakage for older
>>> kexec-tools version by introducing a new kexec flag that can tell
>>> kernel that kexec-tools is ready to use this in-kernel update support.
>>> So, if kexec_load happens without the flag, avoid doing an in-kernel
>>> update on hotplug. I hope that clears the confusion.
>>
>> Yeah, sounds like a good idea. It may be extended in later patch.
> 
> Fixing it in this series itself would be a cleaner way, I guess.

You're suggestion of using a flag makes alot of sense; it is an indication
to the kernel that it is valid/okay to modify the kexec_load elfcorehdr.
Only kexec-tools that understands this (meaning the elfcorehdr buffer is
appropriately sized *and* excludes the elfcorehdr from the purgatory check)
would set that flag.

The roll-out of this feature needs to be coordinated, no doubt. There are three
pieces to this puzzle: this kernel series, the udev rule changes, and the changes
to kexec-tools for kexec_load.

I consider the udev rule changes critical to making this feature work efficiently.
I also think that deploying the udev rules immediately is doable since nothing
references them, yet; they would be NOPs. And they would be in place when the
kernel and/or kexec-tool changes deploy.

However, your point about supporting kexec_load with and without this new flag
means the sysfs nodes upon which the udev rule change rely need to be a bit
smarter now. (I'm assuming these udev rules will be generally accepted as-is,
as they are simple and efficient.)

The sysfs crash_hotplug nodes need to take into account kexec_file_load vs
(kexec_load && new_flag). Generally speaking these crash_hotplug sysfs nodes we
want to be 1 going forward, but where kexec_load/kexec-tools is older and/or no new_flag,
it needs to be 0. In this way the udev rules can remain as proposed and work properly
for kexec_file_load and both flavors of kexec_load.

Good catch! I'll post v22 soon.

Eric

> 
> Thanks
> Hari