[PATCH v2] x86/sev: Fix host kdump support for SNP

Tue Sep 3 12:52:47 PDT 2024

On Tue, Sep 03, 2024, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra at amd.com>
> [    1.671804] AMD-Vi: Using global IVHD EFR:0x841f77e022094ace, EFR2:0x0
> [    1.679835] AMD-Vi: Translation is already enabled - trying to copy translation structures
> [    1.689363] AMD-Vi: Copied DEV table from previous kernel.
> [    1.864369] AMD-Vi: Completion-Wait loop timed out
> [    2.038289] AMD-Vi: Completion-Wait loop timed out
> [    2.212215] AMD-Vi: Completion-Wait loop timed out
> [    2.386141] AMD-Vi: Completion-Wait loop timed out
> [    2.560068] AMD-Vi: Completion-Wait loop timed out
> [    2.733997] AMD-Vi: Completion-Wait loop timed out
> [    2.907927] AMD-Vi: Completion-Wait loop timed out
> [    3.081855] AMD-Vi: Completion-Wait loop timed out
> [    3.225500] AMD-Vi: Completion-Wait loop timed out
> [    3.231083] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> d out
> [    3.579592] AMD-Vi: Completion-Wait loop timed out
> [    3.753164] AMD-Vi: Completion-Wait loop timed out
> [    3.815762] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
> [    3.825347] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240813-snp-host-f2a41ff576cc-dirty #61
> [    3.837188] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM100AB 10/17/2022
> [    3.846215] Call Trace:
> [    3.848939]  <TASK>
> [    3.851277]  dump_stack_lvl+0x2b/0x90
> [    3.855354]  dump_stack+0x14/0x20
> [    3.859050]  panic+0x3b9/0x400
> [    3.862454]  panic_if_irq_remap+0x21/0x30
> [    3.866925]  setup_IO_APIC+0x8aa/0xa50
> [    3.871106]  ? __pfx_amd_iommu_enable_faulting+0x10/0x10
> [    3.877032]  ? __cpuhp_setup_state+0x5e/0xd0
> [    3.881793]  apic_intr_mode_init+0x6a/0xf0
> [    3.886360]  x86_late_time_init+0x28/0x40
> [    3.890832]  start_kernel+0x6a8/0xb50
> [    3.894914]  x86_64_start_reservations+0x1c/0x30
> [    3.900064]  x86_64_start_kernel+0xbf/0x110
> [    3.904729]  ? setup_ghcb+0x12/0x130
> [    3.908716]  common_startup_64+0x13e/0x141
> [    3.913283]  </TASK>
> [    3.915715] in panic
> [    3.918149] in panic_other_cpus_shutdown
> [    3.922523] ---[ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---
> 
> This happens as SNP_SHUTDOWN_EX fails

Exactly what happens?  I.e. why does the PSP (sorry, ASP) need to be full shutdown
in order to get a kdump?  The changelogs for the original SNP panic/kdump support
are frustratingly unhelpful.  They all describe what the patch does, but say
nothing about _why_.

> when SNP VMs are active as the firmware checks every encryption-capable ASID
> to verify that it is not in use by a guest and a DF_FLUSH is not required. If
> a DF_FLUSH is required, the firmware returns DFFLUSH_REQUIRED.
> 
> To fix this, added support to do SNP_DECOMMISSION of all active SNP VMs
> in the panic notifier before doing SNP_SHUTDOWN_EX, but then
> SNP_DECOMMISSION tags all CPUs on which guest has been activated to do
> a WBINVD. This causes SNP_DF_FLUSH command failure with the following
> flow: SNP_DECOMMISSION -> SNP_SHUTDOWN_EX -> SNP_DF_FLUSH ->
> failure with WBINVD_REQUIRED.
> 
> When panic notifier is invoked all other CPUs have already been
> shutdown, so it is not possible to do a wbinvd_on_all_cpus() after
> SNP_DECOMMISSION has been executed. This eventually causes SNP_SHUTDOWN_EX
> to fail after SNP_DECOMMISSION.
> 
> Adding fix to do SNP_DECOMMISSION and subsequent WBINVD on all CPUs
> during NMI shutdown of CPUs as part of disabling virtualization on
> all CPUs via cpu_emergency_disable_virtualization ->
> svm_emergency_disable().
> 
> SNP_DECOMMISSION unbinds the ASID from SNP context and marks the ASID
> as unusable and then transitions the SNP guest context page to a
> firmware page and SNP_SHUTDOWN_EX transitions all pages associated
> with the IOMMU to reclaim state which the hypervisor then transitions
> to hypervisor state, all these page state changes are in the RMP
> table, so there is no loss of guest data as such and the complete
> host memory is captured with the crashkernel boot. There are no
> processes which are being killed and host/guest memory is not
> being altered or modified in any way.
> 
> This fixes and enables crashkernel/kdump on SNP host.

...

> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 714c517dd4b7..30f286a3afb0 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -16,6 +16,7 @@
>  #include <linux/psp-sev.h>
>  #include <linux/pagemap.h>
>  #include <linux/swap.h>
> +#include <linux/delay.h>
>  #include <linux/misc_cgroup.h>
>  #include <linux/processor.h>
>  #include <linux/trace_events.h>
> @@ -89,6 +90,8 @@ static unsigned int nr_asids;
>  static unsigned long *sev_asid_bitmap;
>  static unsigned long *sev_reclaim_asid_bitmap;
>  
> +static DEFINE_SPINLOCK(snp_decommission_lock);
> +static void **snp_asid_to_gctx_pages_map;
>  static int snp_decommission_context(struct kvm *kvm);
>  
>  struct enc_region {
> @@ -2248,6 +2251,9 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  		goto e_free_context;
>  	}
>  
> +	if (snp_asid_to_gctx_pages_map)
> +		snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = sev->snp_context;
> +
>  	return 0;
>  
>  e_free_context:
> @@ -2884,9 +2890,126 @@ static int snp_decommission_context(struct kvm *kvm)
>  	snp_free_firmware_page(sev->snp_context);
>  	sev->snp_context = NULL;
>  
> +	if (snp_asid_to_gctx_pages_map)
> +		snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = NULL;
> +
>  	return 0;
>  }
>  
> +static void __snp_decommission_all(void)
> +{
> +	struct sev_data_snp_addr data = {};
> +	int ret, asid;
> +
> +	if (!snp_asid_to_gctx_pages_map)
> +		return;
> +
> +	for (asid = 1; asid < min_sev_asid; asid++) {
> +		if (snp_asid_to_gctx_pages_map[asid]) {
> +			data.address = __sme_pa(snp_asid_to_gctx_pages_map[asid]);

NULL pointer deref if this races with snp_decommission_context() from task
context.

> +			ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
> +			if (!ret) {

And what happens if SEV_CMD_SNP_DECOMMISSION fails?

> +				snp_free_firmware_page(snp_asid_to_gctx_pages_map[asid]);
> +				snp_asid_to_gctx_pages_map[asid] = NULL;
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * NOTE: called in NMI context from svm_emergency_disable().
> + */
> +void sev_emergency_disable(void)
> +{
> +	static atomic_t waiting_for_cpus_synchronized;
> +	static bool synchronize_cpus_initiated;
> +	static bool snp_decommission_handled;
> +	static atomic_t cpus_synchronized;
> +
> +	if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> +		return;
> +
> +	/*
> +	 * SNP_SHUTDOWN_EX fails when SNP VMs are active as the firmware checks

Define "active".

> +	 * every encryption-capable ASID to verify that it is not in use by a
> +	 * guest and a DF_FLUSH is not required. If a DF_FLUSH is required,
> +	 * the firmware returns DFFLUSH_REQUIRED. To address this, SNP_DECOMMISSION
> +	 * is required to shutdown all active SNP VMs, but SNP_DECOMMISSION tags all
> +	 * CPUs that guest was activated on to do a WBINVD. When panic notifier
> +	 * is invoked all other CPUs have already been shutdown, so it is not
> +	 * possible to do a wbinvd_on_all_cpus() after SNP_DECOMMISSION has been
> +	 * executed. This eventually causes SNP_SHUTDOWN_EX to fail after
> +	 * SNP_DECOMMISSION. To fix this, do SNP_DECOMMISSION and subsequent WBINVD
> +	 * on all CPUs during NMI shutdown of CPUs as part of disabling
> +	 * virtualization on all CPUs via cpu_emergency_disable_virtualization().
> +	 */
> +
> +	spin_lock(&snp_decommission_lock);
> +
> +	/*
> +	 * exit early for call from native_machine_crash_shutdown()
> +	 * as SNP_DECOMMISSION has already been done as part of
> +	 * NMI shutdown of the CPUs.
> +	 */
> +	if (snp_decommission_handled) {
> +		spin_unlock(&snp_decommission_lock);
> +		return;
> +	}
> +
> +	/*
> +	 * Synchronize all CPUs handling NMI before issuing
> +	 * SNP_DECOMMISSION.
> +	 */
> +	if (!synchronize_cpus_initiated) {
> +		/*
> +		 * one CPU handling panic, the other CPU is initiator for
> +		 * CPU synchronization.
> +		 */
> +		atomic_set(&waiting_for_cpus_synchronized, num_online_cpus() - 2);

And what happens when num_online_cpus() == 1?

> +		synchronize_cpus_initiated = true;
> +		/*
> +		 * Ensure CPU synchronization parameters are setup before dropping
> +		 * the lock to let other CPUs continue to reach synchronization.
> +		 */
> +		wmb();
> +
> +		spin_unlock(&snp_decommission_lock);
> +
> +		/*
> +		 * This will not cause system to hang forever as the CPU
> +		 * handling panic waits for maximum one second for
> +		 * other CPUs to stop in nmi_shootdown_cpus().
> +		 */
> +		while (atomic_read(&waiting_for_cpus_synchronized) > 0)
> +		       mdelay(1);
> +
> +		/* Reacquire the lock once CPUs are synchronized */
> +		spin_lock(&snp_decommission_lock);
> +
> +		atomic_set(&cpus_synchronized, 1);
> +	} else {
> +		atomic_dec(&waiting_for_cpus_synchronized);
> +		/*
> +		 * drop the lock to let other CPUs contiune to reach
> +		 * synchronization.
> +		 */
> +		spin_unlock(&snp_decommission_lock);
> +
> +		while (atomic_read(&cpus_synchronized) == 0)
> +		       mdelay(1);
> +
> +		/* Try to re-acquire lock after CPUs are synchronized */
> +		spin_lock(&snp_decommission_lock);
> +	}

Yeah, no.  This is horrific.  If the panic path doesn't provide the necessary
infrastructure to ensure the necessary ordering between the initiating CPU and
responding CPUs, then rework the panic path.