[RFC PATCH] Bug during kexec...not all cpus are stopped

Alok Kataria akataria at vmware.com
Mon Oct 11 15:41:23 EDT 2010


Hi Eric,

Thanks for taking a look.

On Mon, 2010-10-11 at 11:07 -0700, Eric W. Biederman wrote:
> Alok Kataria <akataria at vmware.com> writes:
> 
> > Copying a few more maintainers on the thread...
> > On Fri, 2010-10-08 at 13:34 -0700, Alok Kataria wrote:
> >> Before starting the new kernel kexec calls machine_shutdown to stop all
> >> the cpus, which internally calls native_smp_send_stop. AFAIU, kexec
> >> expects that all the cpus are now halted after that call returns.
> >> Now, looking at the code for native_smp_send_stop, it assumes that all
> >> the processors have processed the REBOOT ipi in 1 second after the IPI
> >> was sent.
> >> native_smp_send_stop()
> >> ---------------------------------------------------------
> >>      apic->send_IPI_allbutself(REBOOT_VECTOR);
> >>
> >>         /* Don't wait longer than a second */
> >>         wait = USEC_PER_SEC;
> >>         while (num_online_cpus() > 1 && wait--)
> >>              udelay(1);
> >> ---------------------------------------------------------
> >>
> >> It just returns after that 1 second irrespective of whether all cpus
> >> were halted or not. This brings up a issue in the kexec case, since we
> >> can have the BSP starting the new kernel and AP's still processing the
> >> REBOOT IPI simultaneously.
> >>
> >> Many distribution kernels use kexec to load the newly installed kernel
> >> during the installation phase, in virtualized environment with the host
> >> heavily overcommitted, we have seen some instances when vcpu fails to
> >> process the IPI in the allotted 1 sec and as a result the AP's end up
> >> accessing uninitialized state (the BSP has already gone ahead with
> >> setting up the new state) and causing GPF's.
> >>
> >> IMO, kexec expects machine_shutdown to return only after all cpus are
> >> stopped.
> >>
> >> The patch below should fix the issue, comments ??
> >>
> >> --
> >> machine_shutdown now takes a parameter "wait", if it is true, it waits
> >> until all the cpus are halted. All the callers except kexec still
> >> fallback to the earlier version of the shutdown call, where it just
> >> waited for max 1 sec before returning.
> 
> The approach seems reasonable.  However on every path except for the
> panic path I would wait for all of the cpus to stop, as that is what
> those code paths are expecting.  Until last year smp_send_stop always
> waited until all of the cpus stopped.  So I think changing
> machine_shutdown should not need to happen.
> 
> This patch cannot be used as is because it changes the parameters to
> smp_send_stop() and there are more than just x86 implementations out
> there.
> 
> Let me propose a slightly different tactic.  We need to separate
> the panic path from the normal path to avoid confusion.  At the
> generic level smp_send_stop is exclusively used for the panic
> path.  So we should keep that, and rename the x86 non-panic path
> function something else.
> 
> Can you rename the x86 smp_send_stop function say stop_all_other_cpus().
> 
> At which point we could implement smp_send_stop as:
> stop_all_other_cpus(0);
> 
> And everywhere else would call stop_all_other_cpus(1) waiting for
> the cpus to actually stop.

Done, I have added a stop_other_cpus function which calls
smp_ops.stop_other_cpus(1)

> 
> I really think we want to wait for the cpus to stop in all cases except
> for panic/kdump where we expect things to be broken and we are doing
> our best to make things work anyway.
Now it does, except in the panic case.

Jeremy, I have modified the xen_reboot bits too so that it now waits
until all cpus are actually stopped, please take a look.

--

x86 smp_ops now has a new op, stop_other_cpus which takes a parameter "wait"
this allows the caller to specify if it wants to stop untill all the cpus
have processed the stop IPI. This is required specifically for the kexec case
where we should wait for all the cpus to be stopped before starting the new
kernel.
We now wait for the cpus to stop in all cases except for panic/kdump where
we expect things to be broken and we are doing our best to make things
work anyway.


Signed-off-by: Alok N Kataria <akataria at vmware.com>
Cc: Eric W. Biederman <ebiederm at xmission.com>
Cc: Jeremy Fitzhardinge <jeremy at xensource.com>

Index: linux-2.6/arch/x86/include/asm/smp.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/smp.h	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/smp.h	2010-10-11 12:00:22.000000000 -0700
@@ -50,7 +50,7 @@ struct smp_ops {
 	void (*smp_prepare_cpus)(unsigned max_cpus);
 	void (*smp_cpus_done)(unsigned max_cpus);
 
-	void (*smp_send_stop)(void);
+	void (*stop_other_cpus)(int wait);
 	void (*smp_send_reschedule)(int cpu);
 
 	int (*cpu_up)(unsigned cpu);
@@ -73,7 +73,12 @@ extern struct smp_ops smp_ops;
 
 static inline void smp_send_stop(void)
 {
-	smp_ops.smp_send_stop();
+	smp_ops.stop_other_cpus(0);
+}
+
+static inline void stop_other_cpus(void)
+{
+	smp_ops.stop_other_cpus(1);
 }
 
 static inline void smp_prepare_boot_cpu(void)
Index: linux-2.6/arch/x86/kernel/reboot.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/reboot.c	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/arch/x86/kernel/reboot.c	2010-10-11 12:05:02.000000000 -0700
@@ -641,7 +641,7 @@ void native_machine_shutdown(void)
 	/* O.K Now that I'm on the appropriate processor,
 	 * stop all of the others.
 	 */
-	smp_send_stop();
+	stop_other_cpus();
 #endif
 
 	lapic_shutdown();
Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/arch/x86/kernel/smp.c	2010-10-11 12:16:42.000000000 -0700
@@ -159,10 +159,10 @@ asmlinkage void smp_reboot_interrupt(voi
 	irq_exit();
 }
 
-static void native_smp_send_stop(void)
+static void native_stop_other_cpus(int wait)
 {
 	unsigned long flags;
-	unsigned long wait;
+	unsigned long timeout;
 
 	if (reboot_force)
 		return;
@@ -179,9 +179,12 @@ static void native_smp_send_stop(void)
 	if (num_online_cpus() > 1) {
 		apic->send_IPI_allbutself(REBOOT_VECTOR);
 
-		/* Don't wait longer than a second */
-		wait = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && wait--)
+		/*
+		 * Don't wait longer than a second if the caller
+		 * didn't ask us to wait.
+		 */
+		timeout = USEC_PER_SEC;
+		while (num_online_cpus() > 1 && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -227,7 +230,7 @@ struct smp_ops smp_ops = {
 	.smp_prepare_cpus	= native_smp_prepare_cpus,
 	.smp_cpus_done		= native_smp_cpus_done,
 
-	.smp_send_stop		= native_smp_send_stop,
+	.stop_other_cpus	= native_stop_other_cpus,
 	.smp_send_reschedule	= native_smp_send_reschedule,
 
 	.cpu_up			= native_cpu_up,
Index: linux-2.6/arch/x86/xen/enlighten.c
===================================================================
--- linux-2.6.orig/arch/x86/xen/enlighten.c	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/arch/x86/xen/enlighten.c	2010-10-11 11:58:53.000000000 -0700
@@ -1018,7 +1018,7 @@ static void xen_reboot(int reason)
 	struct sched_shutdown r = { .reason = reason };
 
 #ifdef CONFIG_SMP
-	smp_send_stop();
+	stop_other_cpus();
 #endif
 
 	if (HYPERVISOR_sched_op(SCHEDOP_shutdown, &r))
Index: linux-2.6/arch/x86/xen/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/xen/smp.c	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/arch/x86/xen/smp.c	2010-10-11 11:58:53.000000000 -0700
@@ -400,9 +400,9 @@ static void stop_self(void *v)
 	BUG();
 }
 
-static void xen_smp_send_stop(void)
+static void xen_stop_other_cpus(int wait)
 {
-	smp_call_function(stop_self, NULL, 0);
+	smp_call_function(stop_self, NULL, wait);
 }
 
 static void xen_smp_send_reschedule(int cpu)
@@ -470,7 +470,7 @@ static const struct smp_ops xen_smp_ops 
 	.cpu_disable = xen_cpu_disable,
 	.play_dead = xen_play_dead,
 
-	.smp_send_stop = xen_smp_send_stop,
+	.stop_other_cpus = xen_stop_other_cpus,
 	.smp_send_reschedule = xen_smp_send_reschedule,
 
 	.send_call_func_ipi = xen_smp_send_call_function_ipi,
Index: linux-2.6/include/linux/smp.h
===================================================================
--- linux-2.6.orig/include/linux/smp.h	2010-10-11 11:56:40.000000000 -0700
+++ linux-2.6/include/linux/smp.h	2010-10-11 12:07:19.000000000 -0700
@@ -116,7 +116,8 @@ extern unsigned int setup_max_cpus;
 
 #else /* !SMP */
 
-static inline void smp_send_stop(void) { }
+static inline void smp_send_stop() { }
+static inline void stop_other_cpus() { }
 
 /*
  *	These macros fold the SMP functionality into a single CPU system







More information about the kexec mailing list