[PATCH v2] x86/sev: Fix host kdump support for SNP

Wed Sep 4 12:54:08 PDT 2024

On Wed, Sep 04, 2024 at 12:37:17PM -0500, Kalra, Ashish wrote:
> Hello Paolo,
> 
> On 9/4/2024 5:29 AM, Paolo Bonzini wrote:
> > On 9/4/24 00:58, Kalra, Ashish wrote:
> >> The issue here is that panic path will ensure that all (other) CPUs
> >> have been shutdown via NMI by checking that they have executed the
> >> NMI shutdown callback.
> >>
> >> But the above synchronization is specifically required for SNP case,
> >> as we don't want to execute the SNP_DECOMMISSION command (to destroy
> >> SNP guest context) while one or more CPUs are still in the NMI VMEXIT
> >> path and still in the process of saving the vCPU state (and still
> >> modifying SNP guest context?) during this VMEXIT path. Therefore, we
> >> ensure that all the CPUs have saved the vCPU state and entered NMI
> >> context before issuing SNP_DECOMMISSION. The point is that this is a
> >> specific SNP requirement (and that's why this specific handling in
> >> sev_emergency_disable()) and i don't know how we will be able to
> >> enforce it in the generic panic path ?
> >
> > I think a simple way to do this is to _first_ kick out other
> > CPUs through NMI, and then the one that is executing
> > emergency_reboot_disable_virtualization().  This also makes
> > emergency_reboot_disable_virtualization() and
> > native_machine_crash_shutdown() more similar, in that
> > the latter already stops other CPUs before disabling
> > virtualization on the one that orchestrates the shutdown.
> >
> > Something like (incomplete, it has to also add the bool argument
> > to cpu_emergency_virt_callback and the actual callbacks):
> >
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 340af8155658..3df25fbe969d 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -111,7 +111,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  
> >      crash_smp_send_stop();
> >  
> > -    cpu_emergency_disable_virtualization();
> > +    cpu_emergency_disable_virtualization(true);
> >  
> >      /*
> >       * Disable Intel PT to stop its logging
> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 0e0a4cf6b5eb..7a86ec786987 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -558,7 +558,7 @@ EXPORT_SYMBOL_GPL(cpu_emergency_unregister_virt_callback);
> >   * reboot.  VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if
> >   * GIF=0, i.e. if the crash occurred between CLGI and STGI.
> >   */
> > -void cpu_emergency_disable_virtualization(void)
> > +void cpu_emergency_disable_virtualization(bool last)
> >  {
> >      cpu_emergency_virt_cb *callback;
> >  
> > @@ -572,7 +572,7 @@ void cpu_emergency_disable_virtualization(void)
> >      rcu_read_lock();
> >      callback = rcu_dereference(cpu_emergency_virt_callback);
> >      if (callback)
> > -        callback();
> > +        callback(last);
> >      rcu_read_unlock();
> >  }
> >  
> > @@ -591,11 +591,11 @@ static void emergency_reboot_disable_virtualization(void)
> >       * other CPUs may have virtualization enabled.
> >       */
> >      if (rcu_access_pointer(cpu_emergency_virt_callback)) {
> > -        /* Safely force _this_ CPU out of VMX/SVM operation. */
> > -        cpu_emergency_disable_virtualization();
> > -
> >          /* Disable VMX/SVM and halt on other CPUs. */
> >          nmi_shootdown_cpus_on_restart();
> > +
> > +        /* Safely force _this_ CPU out of VMX/SVM operation. */
> > +        cpu_emergency_disable_virtualization(true);
> >      }
> >  }
> >  #else
> > @@ -877,7 +877,7 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
> >       * Prepare the CPU for reboot _after_ invoking the callback so that the
> >       * callback can safely use virtualization instructions, e.g. VMCLEAR.
> >       */
> > -    cpu_emergency_disable_virtualization();
> > +    cpu_emergency_disable_virtualization(false);
> >  
> >      atomic_dec(&waiting_for_crash_ipi);
> >  
> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index 18266cc3d98c..9a863348d1a7 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -124,7 +124,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
> >      if (raw_smp_processor_id() == atomic_read(&stopping_cpu))
> >          return NMI_HANDLED;
> >  
> > -    cpu_emergency_disable_virtualization();
> > +    cpu_emergency_disable_virtualization(false);
> >      stop_this_cpu(NULL);
> >  
> >      return NMI_HANDLED;
> > @@ -136,7 +136,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
> >  DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
> >  {
> >      apic_eoi();
> > -    cpu_emergency_disable_virtualization();
> > +    cpu_emergency_disable_virtualization(false);
> >      stop_this_cpu(NULL);
> >  }
> >  
> >
> > And then a second patch adds sev_emergency_disable() and only
> > executes it if last == true.
> >
> This implementation will not work as we need to do wbinvd on all other CPUs after SNP_DECOMMISSION has been issued.
> 
> When the last CPU executes sev_emergency_disable() and issues SNP_DECOMMISSION, by that time all other CPUs have entered the NMI halt loop and then they won't be able to do a wbinvd and hence SNP_SHUTDOWN will eventually fail.
> 
> One way this can work is if all other CPUs can still execute sev_emergency_disable() and in case of last == false, do a spin busy till the last cpu kicks them out of the spin loop and then they do a wbinvd after exiting the spin busy, but then cpu_emergency_disable_virtualization() is/has to be called before atomic_dec(&waiting_for_crash_ipi) in crash_nmi_callback(), so this spin busy in other CPUs will force the last CPU to wait forever (or till it times out waiting for all other CPUs to execute the NMI callback) which makes all of this looks quite fragile.

Hi Ashish,

Your patch (and Paolo's suggested rework) came up in the PUCK call this
morning and I mentioned this point. I was asked to raise some of the
points here so we can continue the discussion on-list:

Have we confirmed that WBINVD actually has to happen after the
SNP_DECOMISSION call? Or do we just need to ensure that the WBINVD was
issued after the last VMEXIT, and that the guest will never VMENTER
again after the WBINVD?

Because if WBINVD can be done prior to SNP_DECOMISSION, then Paolo was
suggesting we could just have:

  kvm_cpu_svm_disable() {
    ...
    WBINVD
  }

  cpu_emergency_disable_virtualization() {
    kvm_cpu_svm_disable()
  }

  smp_stop_nmi_callback() {
    if (raw_smp_processor_id() == atomic_read(&stopping_cpu)) {
      return NMI_HANDLED;
    }
    cpu_emergency_disable_virtualization()
    return NMI_HANDLED
  }

The panic'ing CPU can then:

  1) WBINVD itself (e.g. via cpu_emergency_disable_virtualization())
  2) issue SNP_DECOMMISSION for all ASIDs

That would avoid much of the additional synchronization handling since it
fits more cleanly into existing panic flow. But it's contingent on
whether WBINVD can happen before SNP_DECOMMISION or not so need to
confirm that.

Sean and Paolo also raised some other points (feel free to add if I
missed anything):

  - Since there's a lot of high-level design aspects that need to be
    accounted for, it would be good to have the patch have some sort of
    pseudocode/graph/etc so it's easier to reason about different
    approaches. It would also be good to include this in the commit
    message (generally it's encouraged to describe "why" rather than "how",
    but this is an exceptional case where both are useful to reader).

  - Sean inquired about making the target kdump kernel more agnostic to
    whether or not SNP_SHUTDOWN was done properly, since that might
    allow for capturing state even for edge cases where we can't go
    through the normal cleanup path. I mentioned we'd tried this to some
    degree but hit issues with the IOMMU, and when working around that
    there was another issue but I don't quite recall the specifics.
    Can you post a quick recap of what the issues are with that approach
    so we can determine whether or not this is still an option?

-Mike

> 
> Thanks, Ashish
>