kdump kernel randomly hang with tick_periodic call trace on bare metal system

Baoquan He bhe at redhat.com
Wed Dec 21 20:09:05 PST 2022


On 12/21/22 at 12:46pm, Guilherme G. Piccoli wrote:
> On 20/12/2022 02:51, Baoquan He wrote:
> > On 12/20/22 at 01:41pm, Baoquan He wrote:
> >> On one intel bare metal system, I can randomly reproduce the kdump hang
> >> as below with tick_periodic call trace. Attach the kernel config for
> >> reference.
> > 
> > Forgot mentioning this random hang is also caused by adding
> > 'nr_cpus=2' into normal kernel's cmdline, then triggering crash will get
> > kdump kernel hang as below kdump log shown.
> > 
> 
> The weird thing is that you seem to be using "nr_cpus=1" instead - this
> is the cmdline from the log:
> 
> "nr_cpus=2 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off
> numa=off udev.children-max=2 panic=10 acpi_no_memhotplug
> transparent_hugepage=never nokaslr hest_disable novmcoredd cma=0
> hugetlb_cma=0 disable_cpu_apicid=16 [...]"
> 
> You seems to pass twice the "nr_cpus" thing, and I guess kernel pick the
> last one?

>From the kdump kernel boot log, yes, the nr_cpus=1 is taken. The
parse_early_param() will parse the kernel parameters one by one, then
the last one will take effect. Here, the problem is not at nr_cpus=2 or
1, the bare metal system has 16 cpus, only 2 cpus is present, it seems
to be the halted 14 cpus get wrong message and behave incorrectly to
cause the issue.

> 
> Also, what is "disable_cpu_apicid=16"? Could this be related?

Not really. Please check disable_cpu_apicid in
Documentation/admin-guide/kdump/kdump.rst, it's bsp's apic id.




More information about the kexec mailing list