[Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
Coiby Xu
coxu at redhat.com
Wed Aug 27 01:17:28 PDT 2025
On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
>Hi Marc,
>
>If I understand correctly, you want to reproduce the issue by yourself.
>Then finally I manage to reproduce this issue by playing with the setup
>shared by my collogue. Here are the five prerequisites to reproduce the
>bug,
Hi Marc,
It turns out host kernel and host machine are not absolute prerequisites to
reproduce the problem. But they matter because they can make it much
more difficult to reproduce this problem. I also did a bisection against
QEMU to find out which commit make the issue gone. For details, please
check following inline comments.
>
>1. Guest kernel Newer than commit b5712bf89b4b
>("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
>
>2. Host kernel
> Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
> v6.17.0 don't have this issue.
It turns out with other conditions met, the latest host kernel
(6.17.0-0.rc3) can still reproduce the issue but it's much more
difficult to reproduce it. For example, with RHEL8 kernel
4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3
times at maximum to reproduce it. But for Fedora rawhide kernel
6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue
after triggering kernel crash for 60 consecutive times. For a
comparison, I've listed the times of triggering kernel crash to reproduce
the issue in 10 trials,
RHEL8: 2 1 1 1 1 1 2 1 3 2
Fedora rawhide: 43 60 47 60 12 56 60 45 49 18
>
>3. QEMU <= v6.2
I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.
>
>4. Specific host machines I'm not familiar with the hardware so
>currently I haven't figured out
> what hardware factor makes the issue reproducible. I've attached
> dmidecode outputs of four machines (files inside indmidecode_host folder).
> Two systems (dmidecode_not_work*) can reproduce this issue and the
> other two systems (dmidecode_work*) can't despite all have the same
> product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
> 01234567890123456789AB. One difference that doesn't seem to found in
> the dmidecode output is the two machines that can't reproduce the issue
> have the model name "PnP device PNP0c02" where the problematic
> machines have "R152-P31-00 (01234567890123456789AB)" according to our
> internal web pages that show the hardware info.
It turns out all four machines can reproduce the issue. I tried to
reproduce this issue for 10 times and counted the times to trigger
kernel crash and here's a comparison
R152-P31-00: 2 1 1 1 1 1 2 1 3 2
PnP device PNP0c02: 8 3 5 15 11 18 2 5 12 4
>
>5. The Guest needs to be bridged to a physical host interface.
>Bridging the guest to tun interface can't reproduce the issue (for
> example, the default bridge (virbr0) created by libvirtd uses tun
> interface)
I tried triggering kernel crash for 100 consecutive times for virbr0 in
one trial but can't reproduce it. So I think bridging the guest to a
physical network interface is still a must.
[...]
--
Best regards,
Coiby
More information about the linux-arm-kernel
mailing list