[PATCH] kexec: do syscore_shutdown() in kernel_kexec

Mon Dec 18 23:41:52 PST 2023

On Tue, 2023-12-19 at 12:22 +0800, Baoquan He wrote:
> Add Andrew to CC as Andrew helps to pick kexec/kdump patches.

Ah, thanks, I didn't realise that Andrew pulls in the kexec patches.
> 
> On 12/13/23 at 08:40am, James Gowans wrote:
> ......
> > This has been tested by doing a kexec on x86_64 and aarch64.
> 
> Hi James,
> 
> Thanks for this great patch. My colleagues have opened bug in rhel to
> track this and try to veryfy this patch. However, they can't reproduce
> the issue this patch is fixing. Could you tell more about where and how
> to reproduce so that we can be aware of it better? Thanks in advance.

Sure! The TL;DR is: run a VMX (Intel x86) KVM VM on Linux v6.4+ and do a
kexec while the  KVM VM is still running. Before this patch the system
will triple fault.

In more detail:
Run a bare metal host on a modern Intel CPU with VMX support. The kernel
I was using was 6.7.0-rc5+.
You can totally do this with a QEMU "host" as well, btw, that's how I
did the debugging and attached GDB to it to figure out what was up.

If you want a virtual "host" launch with:

-cpu host -M q35,kernel-irqchip=split,accel=kvm -enable-kvm

Launch a KVM guest VM, eg:

qemu-system-x86_64 \
  -enable-kvm \
  -cdrom alpine-virt-3.19.0-x86_64.iso \
  -nodefaults -nographic -M q35 \
  -serial mon:stdio

While the guest VM is *still running* do a kexec on the host, eg:

kexec -l --reuse-cmdline --initrd=config-6.7.0-rc5+ vmlinuz-6.7.0-rc5+ && \
  kexec -e

The kexec can be to anything, but I generally just kexec to the same
kernel/ramdisk as is currently running. Ie: same-version kexec.

Before this patch the kexec will get stuck, after this the kexec will go
smoothly and the system will end up in the new kernel in a few seconds.

I hope those steps are clear and you can repro this?

BTW, the reason that it's important for the KVM VM to still be running
when the host does the kexec is because KVM internally maintains a usage
counter and will disable virtualisation once all VMs have been
terminated, via:

__fput(kvm_fd)
  kvm_vm_release
    kvm_destroy_vm
      hardware_disable_all
        hardware_disable_all_nolock
          kvm_usage_count--;
          if (!kvm_usage_count)
            on_each_cpu(hardware_disable_nolock, NULL, 1);

So if all KVM fds are closed then kexec will work because VMXE is
cleared on all CPUs when the last VM is destroyed. If the KVM fds are
still open (ie: QEMU process still exists) then the issue manifests.  It
sounds nasty to do a kexec while QEMU processes are still around but
this is a perfectly normal flow for live update:
1. Pause and Serialise VM state
2. kexec
3. deserialise and resume VMs.
In that flow there's no need to actually kill the QEMU process, as long
as the VM is *paused* and has been serialised we can happily kexec.

JG