[PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

Fri Feb 10 16:35:09 PST 2023

On 2/10/23 00:29, Sourabh Jain wrote:
> 
> On 10/02/23 01:09, Eric DeVolder wrote:
>>
>>
>> On 2/9/23 12:43, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>> Eric!
>>>>>
>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>
>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>
>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>> this state as the callback for the .startup method.
>>>>>>
>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>> registered on this state as the callback for the .teardown method.
>>>>>
>>>>> TBH, that's still overengineered. Something like this:
>>>>>
>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>> {
>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>
>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>> }
>>>>>
>>>>> and use this to query the actual state at crash time. That spares all
>>>>> those callback heuristics.
>>>>>
>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>> to for_each_online_cpu().
>>>>>
>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>          tglx
>>>>>
>>>>>
>>>>
>>>> Thomas,
>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>
>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>
>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>   for sadump method).
>>>>
>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>
>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>
>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>> that might rely on the ELF info?)
>>>>
>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>> online/offline.
>>>>
>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>> kernel data structures, not the elfcorehdr.
>>>>
>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>
>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>> makedumpfile, and crash tool I need your opinion on this:
>>>
>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>> number of cpu PT_NOTEs (as the cpus are still present).
>>
>>>
>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>> during the kdump kernel load.
>>>
>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>> regenerate it for CPU hotplug events. Or do we?
>>
>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>
>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>> any of this.
>>
>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>> event and an online event (and unplug/offline). If those were distinguishable, then we
>> could only regenerate on un/plug events.
>>
>> Or perhaps moving to for_each_possible_cpu() is the better choice?
> 
> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
> hot[un]plug case.
> 
> Here is my view on how things should be handled if a core-dump analyzer is dependent on
> elfcorehdr PT_NOTEs to find online/offline CPUs.
> 
> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
> one crash note per CPU for every possible CPU). Though the crash notes are allocated
> during the boot time they are populated when the system is on the crash path.
> 
> This is how crash notes are populated on PowerPC and I am expecting it would be something
> similar on other architectures too.
> 
> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
> note and proceeds further.
> 
> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
> and display the information accordingly.
> 
> Thoughts?
> 
> - Sourabh

In general, I agree with your points. You've presented a strong case to go with 
for_each_possible_cpu() in crash_prepare_elf64_headers() and those crash notes would always be 
present, and we can ignore changes to cpus wrt/ elfcorehdr updates.

But what do we do about kexec_load() syscall? The way the userspace utility works is it determines 
cpus by:
  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
which is not the equivalent of possible_cpus. So the complete list of cpu PT_NOTEs is not generated 
up front. We would need a solution for that?

Thanks,
eric

PS. I'll be on vacation all of next week, returning 20feb.