[PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

Tue Feb 28 22:22:58 PST 2023

On 01/03/23 03:20, Eric DeVolder wrote:
>
>
> On 2/27/23 00:11, Sourabh Jain wrote:
>>
>> On 25/02/23 01:46, Eric DeVolder wrote:
>>>
>>>
>>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>>
>>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>>
>>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>>> Hello Eric,
>>>>>>>>
>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>>> Eric!
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>>
>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>>>> I'm open to better names.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>>>> STARTING section, so I located
>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug 
>>>>>>>>>>> handler is registered on
>>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>>>> hotplug handler is also
>>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>>> method.
>>>>>>>>>>
>>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>>
>>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>>> {
>>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>>
>>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>>>> spares all
>>>>>>>>>> those callback heuristics.
>>>>>>>>>>
>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>>>> vmcoreinfo,
>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace 
>>>>>>>>>>> crash utility,
>>>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>>
>>>>>>>>>> Is the packing actually worth the trouble? What's the actual 
>>>>>>>>>> win?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>          tglx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thomas,
>>>>>>>>> I've investigated the passing of crash notes through the 
>>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>>
>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>>>> its job) does
>>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>>>> coalesce them together.
>>>>>>>>>
>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order 
>>>>>>>>> to determine its
>>>>>>>>>   nr_cpus variable, which is reported in a header, but 
>>>>>>>>> otherwise unused (except
>>>>>>>>>   for sadump method).
>>>>>>>>>
>>>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>>>> does not appear to
>>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>>> various
>>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>>>> that, and also of
>>>>>>>>>   course which are online. In addition, when crash does 
>>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>>>> directly in the vmcore
>>>>>>>>>   image memory, not via the ELF structure. Said differently, 
>>>>>>>>> it appears to me that
>>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>>>> rather it obtains them
>>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>>
>>>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>>>> telling test was that I
>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely 
>>>>>>>>> via chcpu, then generated a
>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, 
>>>>>>>>> it reported the proper number
>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>>>> and changing to a different
>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>>
>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>>>> PT_NOTEs, it obtains the
>>>>>>>>> cpu information directly from kernel data structures. Perhaps 
>>>>>>>>> at one time crash relied
>>>>>>>>> upon the ELF information, but no more. (Perhaps there are 
>>>>>>>>> other crash dump analyzers
>>>>>>>>> that might rely on the ELF info?)
>>>>>>>>>
>>>>>>>>> So, all this to say that I see no need to change 
>>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>>> is no compelling reason to move away from 
>>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>>> online/offline.
>>>>>>>>>
>>>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>>>> register. Perhaps reverting
>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>>>> There does not appear to
>>>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>>>> online/offline for the
>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>>>> utility pulls that from
>>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>>
>>>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>>>> for an optimization
>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu 
>>>>>>>>> changes (because all the percpu
>>>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>>>> an arch choice.
>>>>>>>>
>>>>>>>> Since things are clear on how the PT_NOTES are consumed in 
>>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>>
>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>>>> changes due to CPU hotplug.
>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>>>> number of cpu PT_NOTEs
>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>>>> does not impact the
>>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>>
>>>>>>>>
>>>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>>>> CPUs as system boots and
>>>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>>>> while generating the elfcorehdr
>>>>>>>> during the kdump kernel load.
>>>>>>>>
>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>>>> possible CPU there is no need to
>>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>>
>>>>>>> For onlining/offlining of cpus, there is no need to regenerate 
>>>>>>> the elfcorehdr. However,
>>>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>>>> for_each_present_cpu(). The
>>>>>>> caveat here of course is that if crash utility is the only 
>>>>>>> coredump analyzer of concern,
>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>>>> no need to re-generate them.
>>>>>>>
>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>>>> into mainstream, impacts
>>>>>>> any of this.
>>>>>>>
>>>>>>> Perhaps the one item that might help here is to distinguish 
>>>>>>> between actual hot un/plug of
>>>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>>>> distinguish between a hot plug
>>>>>>> event and an online event (and unplug/offline). If those were 
>>>>>>> distinguishable, then we
>>>>>>> could only regenerate on un/plug events.
>>>>>>>
>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>>
>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>>>> have to worry about
>>>>>> hot[un]plug case.
>>>>>>
>>>>>> Here is my view on how things should be handled if a core-dump 
>>>>>> analyzer is dependent on
>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>>
>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>>>> crash notes (kernel has
>>>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>>>> notes are allocated
>>>>>> during the boot time they are populated when the system is on the 
>>>>>> crash path.
>>>>>>
>>>>>> This is how crash notes are populated on PowerPC and I am 
>>>>>> expecting it would be something
>>>>>> similar on other architectures too.
>>>>>>
>>>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>>>> callback function that updates the
>>>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>>>> crashing CPU updates its own crash
>>>>>> note and proceeds further.
>>>>>>
>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>>>> offline or hot unplugged at the time
>>>>>> system crash. The core-dump analyzer should be able to identify 
>>>>>> [un]/initialized crash notes
>>>>>> and display the information accordingly.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> - Sourabh
>>>>>
>>>>> I've been examining what it would mean to move to 
>>>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think 
>>>>> it means:
>>>>>
>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>>>> crash_prepare_elf64_headers().
>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>>>> elfcorehdr immediately on the load with the elfcorehdr generated 
>>>>> by crash_prepare_elf64_headers().
>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>>>> events.
>>>>
>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>>>> still need cpuhp machinery
>>>> to update FDT kexec segment for CPU hot add case.
>>>
>>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>>
>>>>
>>>>>
>>>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs 
>>>>> for crash_notes.
>>>>>
>>>>> If staying with for_each_present_cpu() is ultimately decided, then 
>>>>> I think leaving the cpuhp machinery in place and each arch could 
>>>>> decide how to handle crash cpu hotplug events. The overhead for 
>>>>> doing this is very minimal, and the events are likely very 
>>>>> infrequent.
>>>>
>>>> I agree. Some architectures may need cpuhp machinery to update 
>>>> kexec segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>>>
>>>> - Sourabh Jain
>>>
>>> OK, I was thinking that the desire was to eliminate the cpuhp 
>>> callbacks. In reality, the desire is to change to 
>>> for_each_possible_cpu(). Given that the kernel creates crash_notes 
>>> for all possible cpus upon kernel boot, there seems to be no reason 
>>> to not do this?
>>>
>>> HOWEVER...
>>>
>>> It's not clear to me that this particular change needs to be part of 
>>> this series. It's inclusion would facilitate PPC support, but 
>>> doesn't "solve" anything in general. In fact it causes kexec_load 
>>> and kexec_file_load to deviate (kexec_load via userspace kexec does 
>>> the equivalent of for_each_present_cpu() where as with this change 
>>> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
>>> event then both would do for_each_possible_cpu()). And if this 
>>> change were to arrive as part of Sourabh's PPC support, then it does 
>>> not appear to impact x86 (not sure about other arches). And the 
>>> 'crash' dump analyzer doesn't care either way.
>>>
>>> Including this change would enable an optimization path (for x86 at 
>>> least) that short-circuits cpu hotplug changes in the arch crash 
>>> handler, for example:
>>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index aca3f1817674..0883f6b11de4 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
>>> kimage *image)
>>>     unsigned long mem, memsz;
>>>     unsigned long elfsz = 0;
>>>
>>> +   if (image->file_mode && (
>>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>>> +       return;
>>> +
>>>     /*
>>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>>      * memory resources.
>>>
>>> I'm not sure that is compelling given the infrequent nature of cpu 
>>> hotplug events.
>> It certainly closes/reduces the window where kdump is not active due 
>> kexec segment update.|
>
> Fair enough. I plan to include this change in v19.
>
>>
>>>
>>> In my mind I still have a question about kexec_load() path. The 
>>> userspace kexec can not do the equivalent of 
>>> for_each_possible_cpu(). It can obtain max possible cpus from 
>>> /sys/devices/system/cpu/possible, but for those cpus not present the 
>>> /sys/devices/system/cpu/cpuXX is not available and so the 
>>> crash_notes entries is not available. My attempts to expose all 
>>> cpuXX lead to odd behavior that was requiring changes in ACPI and 
>>> arch code that looked untenable.
>>>
>>> There seem to be these options available for kexec_load() path:
>>> - immediately rewrite the elfcorehdr upon load via a call to 
>>> crash_prepare_elf64_headers(). I've made this work with the 
>>> following, as proof of concept:
>> Yes regenerating/patching the elfcorehdr could be an option for 
>> kexec_load syscall.
> So this is not needed by x86, but more so by ppc. Should this change 
> be in the ppc set or this set?
Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, 
there is no need for elfcorehdr regeneration on PowerPC for kexec_load case
for CPU hotplug events.

My ask is, keep the cpuhp machinery so that architectures can update 
other kexec segments if needed of CPU add/remove case.

In case x86 has nothing to update on CPU hotplug events and you want 
remove the CPU hp machinery I can add the same
in ppc patch series.

Thanks,
Sourabh Jain