[PATCH v2] x86/sev: Fix host kdump support for SNP
Sean Christopherson
seanjc at google.com
Tue Sep 3 12:52:47 PDT 2024
On Tue, Sep 03, 2024, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra at amd.com>
> [ 1.671804] AMD-Vi: Using global IVHD EFR:0x841f77e022094ace, EFR2:0x0
> [ 1.679835] AMD-Vi: Translation is already enabled - trying to copy translation structures
> [ 1.689363] AMD-Vi: Copied DEV table from previous kernel.
> [ 1.864369] AMD-Vi: Completion-Wait loop timed out
> [ 2.038289] AMD-Vi: Completion-Wait loop timed out
> [ 2.212215] AMD-Vi: Completion-Wait loop timed out
> [ 2.386141] AMD-Vi: Completion-Wait loop timed out
> [ 2.560068] AMD-Vi: Completion-Wait loop timed out
> [ 2.733997] AMD-Vi: Completion-Wait loop timed out
> [ 2.907927] AMD-Vi: Completion-Wait loop timed out
> [ 3.081855] AMD-Vi: Completion-Wait loop timed out
> [ 3.225500] AMD-Vi: Completion-Wait loop timed out
> [ 3.231083] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> d out
> [ 3.579592] AMD-Vi: Completion-Wait loop timed out
> [ 3.753164] AMD-Vi: Completion-Wait loop timed out
> [ 3.815762] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
> [ 3.825347] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240813-snp-host-f2a41ff576cc-dirty #61
> [ 3.837188] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM100AB 10/17/2022
> [ 3.846215] Call Trace:
> [ 3.848939] <TASK>
> [ 3.851277] dump_stack_lvl+0x2b/0x90
> [ 3.855354] dump_stack+0x14/0x20
> [ 3.859050] panic+0x3b9/0x400
> [ 3.862454] panic_if_irq_remap+0x21/0x30
> [ 3.866925] setup_IO_APIC+0x8aa/0xa50
> [ 3.871106] ? __pfx_amd_iommu_enable_faulting+0x10/0x10
> [ 3.877032] ? __cpuhp_setup_state+0x5e/0xd0
> [ 3.881793] apic_intr_mode_init+0x6a/0xf0
> [ 3.886360] x86_late_time_init+0x28/0x40
> [ 3.890832] start_kernel+0x6a8/0xb50
> [ 3.894914] x86_64_start_reservations+0x1c/0x30
> [ 3.900064] x86_64_start_kernel+0xbf/0x110
> [ 3.904729] ? setup_ghcb+0x12/0x130
> [ 3.908716] common_startup_64+0x13e/0x141
> [ 3.913283] </TASK>
> [ 3.915715] in panic
> [ 3.918149] in panic_other_cpus_shutdown
> [ 3.922523] ---[ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---
>
> This happens as SNP_SHUTDOWN_EX fails
Exactly what happens? I.e. why does the PSP (sorry, ASP) need to be full shutdown
in order to get a kdump? The changelogs for the original SNP panic/kdump support
are frustratingly unhelpful. They all describe what the patch does, but say
nothing about _why_.
> when SNP VMs are active as the firmware checks every encryption-capable ASID
> to verify that it is not in use by a guest and a DF_FLUSH is not required. If
> a DF_FLUSH is required, the firmware returns DFFLUSH_REQUIRED.
>
> To fix this, added support to do SNP_DECOMMISSION of all active SNP VMs
> in the panic notifier before doing SNP_SHUTDOWN_EX, but then
> SNP_DECOMMISSION tags all CPUs on which guest has been activated to do
> a WBINVD. This causes SNP_DF_FLUSH command failure with the following
> flow: SNP_DECOMMISSION -> SNP_SHUTDOWN_EX -> SNP_DF_FLUSH ->
> failure with WBINVD_REQUIRED.
>
> When panic notifier is invoked all other CPUs have already been
> shutdown, so it is not possible to do a wbinvd_on_all_cpus() after
> SNP_DECOMMISSION has been executed. This eventually causes SNP_SHUTDOWN_EX
> to fail after SNP_DECOMMISSION.
>
> Adding fix to do SNP_DECOMMISSION and subsequent WBINVD on all CPUs
> during NMI shutdown of CPUs as part of disabling virtualization on
> all CPUs via cpu_emergency_disable_virtualization ->
> svm_emergency_disable().
>
> SNP_DECOMMISSION unbinds the ASID from SNP context and marks the ASID
> as unusable and then transitions the SNP guest context page to a
> firmware page and SNP_SHUTDOWN_EX transitions all pages associated
> with the IOMMU to reclaim state which the hypervisor then transitions
> to hypervisor state, all these page state changes are in the RMP
> table, so there is no loss of guest data as such and the complete
> host memory is captured with the crashkernel boot. There are no
> processes which are being killed and host/guest memory is not
> being altered or modified in any way.
>
> This fixes and enables crashkernel/kdump on SNP host.
...
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 714c517dd4b7..30f286a3afb0 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -16,6 +16,7 @@
> #include <linux/psp-sev.h>
> #include <linux/pagemap.h>
> #include <linux/swap.h>
> +#include <linux/delay.h>
> #include <linux/misc_cgroup.h>
> #include <linux/processor.h>
> #include <linux/trace_events.h>
> @@ -89,6 +90,8 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> +static DEFINE_SPINLOCK(snp_decommission_lock);
> +static void **snp_asid_to_gctx_pages_map;
> static int snp_decommission_context(struct kvm *kvm);
>
> struct enc_region {
> @@ -2248,6 +2251,9 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> goto e_free_context;
> }
>
> + if (snp_asid_to_gctx_pages_map)
> + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = sev->snp_context;
> +
> return 0;
>
> e_free_context:
> @@ -2884,9 +2890,126 @@ static int snp_decommission_context(struct kvm *kvm)
> snp_free_firmware_page(sev->snp_context);
> sev->snp_context = NULL;
>
> + if (snp_asid_to_gctx_pages_map)
> + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = NULL;
> +
> return 0;
> }
>
> +static void __snp_decommission_all(void)
> +{
> + struct sev_data_snp_addr data = {};
> + int ret, asid;
> +
> + if (!snp_asid_to_gctx_pages_map)
> + return;
> +
> + for (asid = 1; asid < min_sev_asid; asid++) {
> + if (snp_asid_to_gctx_pages_map[asid]) {
> + data.address = __sme_pa(snp_asid_to_gctx_pages_map[asid]);
NULL pointer deref if this races with snp_decommission_context() from task
context.
> + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
> + if (!ret) {
And what happens if SEV_CMD_SNP_DECOMMISSION fails?
> + snp_free_firmware_page(snp_asid_to_gctx_pages_map[asid]);
> + snp_asid_to_gctx_pages_map[asid] = NULL;
> + }
> + }
> + }
> +}
> +
> +/*
> + * NOTE: called in NMI context from svm_emergency_disable().
> + */
> +void sev_emergency_disable(void)
> +{
> + static atomic_t waiting_for_cpus_synchronized;
> + static bool synchronize_cpus_initiated;
> + static bool snp_decommission_handled;
> + static atomic_t cpus_synchronized;
> +
> + if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> + return;
> +
> + /*
> + * SNP_SHUTDOWN_EX fails when SNP VMs are active as the firmware checks
Define "active".
> + * every encryption-capable ASID to verify that it is not in use by a
> + * guest and a DF_FLUSH is not required. If a DF_FLUSH is required,
> + * the firmware returns DFFLUSH_REQUIRED. To address this, SNP_DECOMMISSION
> + * is required to shutdown all active SNP VMs, but SNP_DECOMMISSION tags all
> + * CPUs that guest was activated on to do a WBINVD. When panic notifier
> + * is invoked all other CPUs have already been shutdown, so it is not
> + * possible to do a wbinvd_on_all_cpus() after SNP_DECOMMISSION has been
> + * executed. This eventually causes SNP_SHUTDOWN_EX to fail after
> + * SNP_DECOMMISSION. To fix this, do SNP_DECOMMISSION and subsequent WBINVD
> + * on all CPUs during NMI shutdown of CPUs as part of disabling
> + * virtualization on all CPUs via cpu_emergency_disable_virtualization().
> + */
> +
> + spin_lock(&snp_decommission_lock);
> +
> + /*
> + * exit early for call from native_machine_crash_shutdown()
> + * as SNP_DECOMMISSION has already been done as part of
> + * NMI shutdown of the CPUs.
> + */
> + if (snp_decommission_handled) {
> + spin_unlock(&snp_decommission_lock);
> + return;
> + }
> +
> + /*
> + * Synchronize all CPUs handling NMI before issuing
> + * SNP_DECOMMISSION.
> + */
> + if (!synchronize_cpus_initiated) {
> + /*
> + * one CPU handling panic, the other CPU is initiator for
> + * CPU synchronization.
> + */
> + atomic_set(&waiting_for_cpus_synchronized, num_online_cpus() - 2);
And what happens when num_online_cpus() == 1?
> + synchronize_cpus_initiated = true;
> + /*
> + * Ensure CPU synchronization parameters are setup before dropping
> + * the lock to let other CPUs continue to reach synchronization.
> + */
> + wmb();
> +
> + spin_unlock(&snp_decommission_lock);
> +
> + /*
> + * This will not cause system to hang forever as the CPU
> + * handling panic waits for maximum one second for
> + * other CPUs to stop in nmi_shootdown_cpus().
> + */
> + while (atomic_read(&waiting_for_cpus_synchronized) > 0)
> + mdelay(1);
> +
> + /* Reacquire the lock once CPUs are synchronized */
> + spin_lock(&snp_decommission_lock);
> +
> + atomic_set(&cpus_synchronized, 1);
> + } else {
> + atomic_dec(&waiting_for_cpus_synchronized);
> + /*
> + * drop the lock to let other CPUs contiune to reach
> + * synchronization.
> + */
> + spin_unlock(&snp_decommission_lock);
> +
> + while (atomic_read(&cpus_synchronized) == 0)
> + mdelay(1);
> +
> + /* Try to re-acquire lock after CPUs are synchronized */
> + spin_lock(&snp_decommission_lock);
> + }
Yeah, no. This is horrific. If the panic path doesn't provide the necessary
infrastructure to ensure the necessary ordering between the initiating CPU and
responding CPUs, then rework the panic path.
More information about the kexec
mailing list