[Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

Wed Aug 27 01:17:28 PDT 2025

On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
>Hi Marc,
>
>If I understand correctly, you want to reproduce the issue by yourself.
>Then finally I manage to reproduce this issue by playing with the setup
>shared by my collogue. Here are the five prerequisites to reproduce the
>bug,

Hi Marc,

It turns out host kernel and host machine are not absolute prerequisites to
reproduce the problem. But they matter because they can make it much
more difficult to reproduce this problem. I also did a bisection against
QEMU to find out which commit make the issue gone. For details, please
check following inline comments.

>
>1. Guest kernel    Newer than commit b5712bf89b4b 
>("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
>
>2. Host kernel
>   Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
>   v6.17.0 don't have this issue.

It turns out with other conditions met, the latest host kernel
(6.17.0-0.rc3) can still reproduce the issue but it's much more
difficult to reproduce it. For example, with RHEL8 kernel
4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3
times at maximum to reproduce it. But for Fedora rawhide kernel
6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue
after triggering kernel crash for 60 consecutive times. For a
comparison, I've listed the times of triggering kernel crash to reproduce
the issue in 10 trials,

RHEL8:           2  1  1  1  1  1  2  1  3  2
Fedora rawhide: 43 60 47 60 12 56 60 45 49 18

>
>3. QEMU <= v6.2

I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.

>
>4. Specific host machines    I'm not familiar with the hardware so 
>currently I haven't figured out
>   what hardware factor makes the issue reproducible. I've attached
>   dmidecode outputs of four machines (files inside indmidecode_host folder).
>   Two systems (dmidecode_not_work*) can reproduce this issue and the
>   other two systems (dmidecode_work*) can't despite all have the same
>   product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
>   01234567890123456789AB. One difference that doesn't seem to found in
>   the dmidecode output is the two machines that can't reproduce the issue
>   have the model name "PnP device PNP0c02" where the problematic
>   machines have "R152-P31-00 (01234567890123456789AB)" according to our
>   internal web pages that show the hardware info.

It turns out all four machines can reproduce the issue. I tried to
reproduce this issue for 10 times and counted the times to trigger
kernel crash and here's a comparison

R152-P31-00:        2  1 1  1  1  1 2 1  3 2 
PnP device PNP0c02: 8  3 5 15 11 18 2 5 12 4 

>
>5. The Guest needs to be bridged to a physical host interface.    
>Bridging the guest to tun interface can't reproduce the issue (for
>   example, the default bridge (virbr0) created by libvirtd uses tun
>   interface)

I tried triggering kernel crash for 100 consecutive times for virbr0 in
one trial but can't reproduce it. So I think bridging the guest to a
physical network interface is still a must.

[...]

-- 
Best regards,
Coiby