[PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

Tue Feb 28 10:52:52 PST 2023

On 2/28/23 06:44, Baoquan He wrote:
> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>
>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when
>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>> hotplug handler is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>> This crash hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash
>>>>>>>> time. That spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>> userspace crash utility,
>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>           tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the
>>>>>>> vmcore. What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>> to do its job) does
>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>> does coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>> order to determine its
>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>> otherwise unused (except
>>>>>>>    for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>> cpus, does not appear to
>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>> nr_cpus from that, and also of
>>>>>>>    course which are online. In addition, when crash does
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>> technique directly in the vmcore
>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>> differently, it appears to me that
>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>> cpus; rather it obtains them
>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>> most telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>> sparsely via chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>> vmcore, it reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>> PT_NOTE), and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon
>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures.
>>>>>>> Perhaps at one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>> are other crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>> to register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>> answer. There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>> went online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>> crash utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been
>>>>>>> advocating for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>> changes (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>> that as an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>   From what I understood, crash notes are prepared for
>>>>>> possible CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>> CPU while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>> the elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>> notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>> function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> In general, I agree with your points. You've presented a strong case to
>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>> those crash notes would always be present, and we can ignore changes to
>>> cpus wrt/ elfcorehdr updates.
>>>
>>> But what do we do about kexec_load() syscall? The way the userspace
>>> utility works is it determines cpus by:
>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>> which is not the equivalent of possible_cpus. So the complete list of
>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>> that?
>> Hello Eric,
>>
>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>> isn't that equivalent to possible CPUs?
>>
>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
>> it is possible CPUs.
> 
Baoquan,

>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
> cpus. If these are true, we can use them.

Thomas Gleixner has pointed out that:

  glibc tries to evaluate that in the following order:
   1) /sys/devices/system/cpu/cpu*
      That's present CPUs not possible CPUs
   2) /proc/stat
      That's online CPUs
   3) sched_getaffinity()
      That's online CPUs at best. In the worst case it's an affinity mask
      which is set on a process group

meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the 
/sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus 
userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.

> 
> But I am wondering why the existing present cpu way is going to be
> discarded. Sorry, I tried to go through this thread, it's too long, can
> anyone summarize the reason with shorter and clear sentences. Sorry
> again for that.

By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the 
kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. 
It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, 
as all the [possible] cpus are already described in the elfcorehdr.

I've pointed out that this change would have kexec_load (as kexec-tools can only write out, 
initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the 
possible_cpus()). This deviation would disappear after the first hotplug event (due to calling 
crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its 
elfcorehdr upon initial load (by calling into the crash hotplug handler).

Can you think of any side effects of going to for_each_possible_cpu()?

Thanks,
eric

> 
>>
>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
>> /sys/devices/system/cpu/possible for kexec_load case.
>>
>> Thoughts?
>>
>> - Sourabh Jain
>>
>