[Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

Coiby Xu coxu at redhat.com
Fri Aug 22 20:00:11 PDT 2025


Hi Marc,

If I understand correctly, you want to reproduce the issue by yourself.
Then finally I manage to reproduce this issue by playing with the setup
shared by my collogue. Here are the five prerequisites to reproduce the
bug,

1. Guest kernel 
    Newer than commit b5712bf89b4b ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")

2. Host kernel
    Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
    v6.17.0 don't have this issue.

3. QEMU <= v6.2

4. Specific host machines 
    I'm not familiar with the hardware so currently I haven't figured out
    what hardware factor makes the issue reproducible. I've attached
    dmidecode outputs of four machines (files inside indmidecode_host folder).
    Two systems (dmidecode_not_work*) can reproduce this issue and the
    other two systems (dmidecode_work*) can't despite all have the same
    product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
    01234567890123456789AB. One difference that doesn't seem to found in
    the dmidecode output is the two machines that can't reproduce the issue
    have the model name "PnP device PNP0c02" where the problematic
    machines have "R152-P31-00 (01234567890123456789AB)" according to our
    internal web pages that show the hardware info.

5. The Guest needs to be bridged to a physical host interface. 
    Bridging the guest to tun interface can't reproduce the issue (for
    example, the default bridge (virbr0) created by libvirtd uses tun
    interface)

With the above conditions met, I can reproduce the issue simply with 
Fedora Cloud Base 42 image,

1. Start the VM
    qemu-system-aarch64 -cpu host   -machine virt \
    -device virtio-net-pci,netdev=hn0,id=nic1,mac=00:16:3e:3d:5f:b8 \
    -netdev bridge,id=hn0,br=br0,helper=/usr/local/libexec/qemu-bridge-helper \
    -hda /var/lib/libvirt/images/f42_1.qcow2 \
    -accel kvm -boot d \
    -drive if=pflash,format=raw,readonly,file=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw \
    -m 35840 -serial stdio -smp 16

2. Set up kdump to dump vmcore to a remote NFS server
    dnf install kdump-utils nfs-utils -y
    echo nfs NFS_SERVER:EXPORT_PATH >> /etc/kdump.conf
    systemctl enable kdump
    kdumpctl reset-crashkernel 
    systemctl reboot

3. After rebooting, trigger 1st kernel crash
    If kdump works i.e. DHCP works, you will need to trigger kernel crash
    again until it doesn't work. In my experience, repeating this step for 6
    consecutive times will surely lead to one time that DHCP doesn't
    work.

Note f42_1.qcow2 was created from Fedora Cloud Base 42 image
https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2

Considering QEMU 6.12 was released about 4 years ago, do you think there
is an need to further dig into this problem to find out how the five
prerequisite conditions interplay with each other to create the bug? If
you think it's worth the efforts, I'll do a bisection against QEMU to
find out the 1st bad commit and also provide other debugging info you
need.

On Wed, Aug 20, 2025 at 09:56:50AM +0100, Marc Zyngier wrote:
>On Wed, 20 Aug 2025 00:30:12 +0100,
>Coiby Xu <coxu at redhat.com> wrote:
>>
>> On Wed, Aug 13, 2025 at 08:08:28PM +0800, Coiby Xu wrote:
>> > On Tue, Aug 12, 2025 at 02:14:25PM +0100, Marc Zyngier wrote:
>> [...]
>> >>
>> >> Can you at the very least share:
>>
>> Thanks for your patience! I've attached a zip file with the info you
>> need. Additionally I've included the dmidecode of guest
>> (dmidecode_guest), host machine (dmidecode_host) and the domain info
>> of guest (libvirt.xml) in case they may be helpful. If you need further
>> info or any experiment I need to do, feel free to let me know! Now I
>> have access to the host machine so I can respond much faster.
>>
>> >>
>> >> - the boot log of the guest on its first kernel
>>
>> Please check file boot_log_1st_kernel
>
>Old kernel. It would have been better to use a vanilla v6.16, so that
>we know exactly what you are running. I have zero interest in finding
>out what 6.15.9-201.fc42.aarch64 corresponds to in real life.

Thanks for the suggestion! I've built v6.16 and attached the logs.
Please check 04_not_work/boot_log_{1st,2nd}_kernel.

Btw, I'm curious to know why you want a vanilla v6.16. Is it because you
are worried a Fedora kernel can be so different from a vanilla v6.16
that it can obscure the problem?

>
>> >> - the boot log of the guest running kdump
>>
>> boot_log_2nd_kernel
>
>Same thing.
>
>>
>> >>
>> >> - the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
>> >> running both kernels
>>
>> vgic-state_{1st,2nd}_kernel
>
>What is the host running? It also looks like a pre-6.16 kernel, which
>lacks important information.

The host is running RHEL8.6. But I can confirm Fedora kernel
6.10.0-64.fc41.aarch64 can also reproduce the issue but
not latest ones like 6.17.0-0.rc2.24.fc43.aarch64.

>
>>
>> >>
>> >> - the QEMU command-line to get to run the whole thing
>>
>> qemu_cmdline
>
>I'm sorry, but that doesn't look like a command line as I know it. I
>certainly cannot feed this to QEMU and reproduce your findings.

Sorry I didn't realize you want to reproduce the issue. Previously I
hadn't reproduced the issue and thought it's not easy to reproduce it. Thus I
merely shared the cmdline generated by libvirt/virt-install so you may
find something suspicious.

>
>Now, there is *one* thing that is interesting:
>
>The second vgic_state dump indicates that LPI 8225 is routed to
>vcpu-3. Given that your guest boots into the second kernel on vcpu-0,
>and that this is the only online vcpu at this stage, the LPI will
>never be presented to the CPU (and the vgic has it as pending, which
>is what I'd expect).
>
>I'd suggest you instrument the second kernel to try and see why this
>affinity is not changed.

Currently, I'm not familiar with interrupts. But I notice for the 2nd
kernel, /proc/irq/*/smp_affinity of the 2nd kernel all have the same
value 1 and /proc/interrupts only list one CPU. If you want me to try
other things, please let me know.

>
>Thanks,
>
>	M.
>
>-- 
>Jazz isn't dead. It just smells funny.
>

-- 
Best regards,
Coiby
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug_info_VGICv3_not_work_for_kdump.zip
Type: application/zip
Size: 111284 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20250823/1c2ad9d2/attachment-0001.zip>


More information about the kexec mailing list