[PATCH v17 3/6] crash: add generic infrastructure for crash hotplug support

Fri Jan 20 13:24:32 PST 2023

On 1/19/23 15:31, Thomas Gleixner wrote:
> Eric!
> 
> On Wed, Jan 18 2023 at 16:35, Eric DeVolder wrote:
>> CPU and memory change notifications are received in order to
>> regenerate the elfcorehdr.
>>
>> To support cpu hotplug, a callback is registered to capture the
>> CPUHP_AP_ONLINE_DYN online and offline events via
>> cpuhp_setup_state_nocalls().
> 
> This sentence does not make sense. The callback is not registered to
> capture CPUHP_AP_ONLINE_DYN events >
> What this does is: It installs a dynamic CPU hotplug state with
> callbacks for online and offline. These callbacks store information
> about a CPU coming up and going down. Right?

I agree, the wording is wrong; this code taps into that state, as you suggest, in order to handle 
the online and offline events.

> 
> But why are they required and what's the value?
> 
> This changelog tells WHAT it does and not WHY. I can see the WHAT from
> the patch itself.
> 
> Don't tell me the WHY is in the cover letter. The cover letter is not
> part of the commits and changelogs have to be self contained.
> 
> Now let me cite from your cover letter:
> 
>> When the kdump service is loaded, if a CPU or memory is hot
>> un/plugged, the crash elfcorehdr, which describes the CPUs
>> and memory in the system, must also be updated, else the resulting
>> vmcore is inaccurate (eg. missing either CPU context or memory
>> regions).
I'll work to improve the wording and why for the next iteration.

> 
> The CPU hotplug state you are using for this is patently inaccurate
> too. With your approach the CPU is tracked as online very late in the
> hotplug process and tracked as offline very early on unplug.
> 
> So if the kernel crashes before/after the plug/unplug tracking event
> then your recorded state is bogus and given the amount of callbacks
> between the real online/offline and the recording point there is a
> pretty large window.
> 
> You can argue that this is better than the current state and considered
> good enough for whatever reason, but such information wants to be in the
> changelog, no?
I agree!  I admit that CPUHP_AP_ONLINE_DYN may (is) not the best choice. I did spend time looking at 
the cpu hotplug infrastructure, but did not learn a better/correct way. Fwiw: 
https://lore.kernel.org/lkml/20211118174948.37435-1-eric.devolder@oracle.com/:

"The second problem is the use of CPUHP_AP_ONLINE_DYN.  The
cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the
regeneration of the elfcorehdr, thus the need to explicitly check and
exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
cpuhp_setup_state() was utilized, then the offline cpu would no longer
be in foreach_present_cpu(), and this change could be eliminated. I do
not understand cpuhp_setup_state() well enough to choose, or create,
appropriate value(s)."

The problem described (and worked around in this patch series) is the behavior/window you point out. 
I'd prefer to narrow the window, if possible. The states/values I tried did not work; any 
suggestions for a more appropriate state/value would be most welcomed!

> 
> Thanks,
> 
>          tglx
> 
> Hint: The requirements for changelogs are well documented in Documentation/process/
> 
> 
Thomas, thank you for looking at this!
eric