[RFC PATCH] Bug during kexec...not all cpus are stopped

Eric W. Biederman ebiederm at xmission.com
Mon Oct 11 17:17:25 EDT 2010


Alok Kataria <akataria at vmware.com> writes:

> Hi Eric,
>
> Thanks for taking a look.
>
> On Mon, 2010-10-11 at 11:07 -0700, Eric W. Biederman wrote:
>> Alok Kataria <akataria at vmware.com> writes:
>> 
>> > Copying a few more maintainers on the thread...
>> > On Fri, 2010-10-08 at 13:34 -0700, Alok Kataria wrote:
>> >> Before starting the new kernel kexec calls machine_shutdown to stop all
>> >> the cpus, which internally calls native_smp_send_stop. AFAIU, kexec
>> >> expects that all the cpus are now halted after that call returns.
>> >> Now, looking at the code for native_smp_send_stop, it assumes that all
>> >> the processors have processed the REBOOT ipi in 1 second after the IPI
>> >> was sent.
>> >> native_smp_send_stop()
>> >> ---------------------------------------------------------
>> >>      apic->send_IPI_allbutself(REBOOT_VECTOR);
>> >>
>> >>         /* Don't wait longer than a second */
>> >>         wait = USEC_PER_SEC;
>> >>         while (num_online_cpus() > 1 && wait--)
>> >>              udelay(1);
>> >> ---------------------------------------------------------
>> >>
>> >> It just returns after that 1 second irrespective of whether all cpus
>> >> were halted or not. This brings up a issue in the kexec case, since we
>> >> can have the BSP starting the new kernel and AP's still processing the
>> >> REBOOT IPI simultaneously.
>> >>
>> >> Many distribution kernels use kexec to load the newly installed kernel
>> >> during the installation phase, in virtualized environment with the host
>> >> heavily overcommitted, we have seen some instances when vcpu fails to
>> >> process the IPI in the allotted 1 sec and as a result the AP's end up
>> >> accessing uninitialized state (the BSP has already gone ahead with
>> >> setting up the new state) and causing GPF's.
>> >>
>> >> IMO, kexec expects machine_shutdown to return only after all cpus are
>> >> stopped.
>> >>
>> >> The patch below should fix the issue, comments ??
>> >>
>> >> --
>> >> machine_shutdown now takes a parameter "wait", if it is true, it waits
>> >> until all the cpus are halted. All the callers except kexec still
>> >> fallback to the earlier version of the shutdown call, where it just
>> >> waited for max 1 sec before returning.
>> 
>> The approach seems reasonable.  However on every path except for the
>> panic path I would wait for all of the cpus to stop, as that is what
>> those code paths are expecting.  Until last year smp_send_stop always
>> waited until all of the cpus stopped.  So I think changing
>> machine_shutdown should not need to happen.
>> 
>> This patch cannot be used as is because it changes the parameters to
>> smp_send_stop() and there are more than just x86 implementations out
>> there.
>> 
>> Let me propose a slightly different tactic.  We need to separate
>> the panic path from the normal path to avoid confusion.  At the
>> generic level smp_send_stop is exclusively used for the panic
>> path.  So we should keep that, and rename the x86 non-panic path
>> function something else.
>> 
>> Can you rename the x86 smp_send_stop function say stop_all_other_cpus().
>> 
>> At which point we could implement smp_send_stop as:
>> stop_all_other_cpus(0);
>> 
>> And everywhere else would call stop_all_other_cpus(1) waiting for
>> the cpus to actually stop.
>
> Done, I have added a stop_other_cpus function which calls
> smp_ops.stop_other_cpus(1)
>
>> 
>> I really think we want to wait for the cpus to stop in all cases except
>> for panic/kdump where we expect things to be broken and we are doing
>> our best to make things work anyway.
> Now it does, except in the panic case.
>
> Jeremy, I have modified the xen_reboot bits too so that it now waits
> until all cpus are actually stopped, please take a look.
>
> --
>
> x86 smp_ops now has a new op, stop_other_cpus which takes a parameter "wait"
> this allows the caller to specify if it wants to stop untill all the cpus
> have processed the stop IPI. This is required specifically for the kexec case
> where we should wait for all the cpus to be stopped before starting the new
> kernel.
> We now wait for the cpus to stop in all cases except for panic/kdump where
> we expect things to be broken and we are doing our best to make things
> work anyway.

One little nit.


> Signed-off-by: Alok N Kataria <akataria at vmware.com>
> Cc: Eric W. Biederman <ebiederm at xmission.com>
> Cc: Jeremy Fitzhardinge <jeremy at xensource.com>
>
> Index: linux-2.6/arch/x86/include/asm/smp.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/smp.h	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/smp.h	2010-10-11 12:00:22.000000000 -0700
> @@ -50,7 +50,7 @@ struct smp_ops {
>  	void (*smp_prepare_cpus)(unsigned max_cpus);
>  	void (*smp_cpus_done)(unsigned max_cpus);
>  
> -	void (*smp_send_stop)(void);
> +	void (*stop_other_cpus)(int wait);
>  	void (*smp_send_reschedule)(int cpu);
>  
>  	int (*cpu_up)(unsigned cpu);
> @@ -73,7 +73,12 @@ extern struct smp_ops smp_ops;
>  
>  static inline void smp_send_stop(void)
>  {
> -	smp_ops.smp_send_stop();
> +	smp_ops.stop_other_cpus(0);
> +}
> +
> +static inline void stop_other_cpus(void)
> +{
> +	smp_ops.stop_other_cpus(1);
>  }
>  
>  static inline void smp_prepare_boot_cpu(void)
> Index: linux-2.6/arch/x86/kernel/reboot.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/reboot.c	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/reboot.c	2010-10-11 12:05:02.000000000 -0700
> @@ -641,7 +641,7 @@ void native_machine_shutdown(void)
>  	/* O.K Now that I'm on the appropriate processor,
>  	 * stop all of the others.
>  	 */
> -	smp_send_stop();
> +	stop_other_cpus();
>  #endif
>  
>  	lapic_shutdown();
> Index: linux-2.6/arch/x86/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/smp.c	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/smp.c	2010-10-11 12:16:42.000000000 -0700
> @@ -159,10 +159,10 @@ asmlinkage void smp_reboot_interrupt(voi
>  	irq_exit();
>  }
>  
> -static void native_smp_send_stop(void)
> +static void native_stop_other_cpus(int wait)
>  {
>  	unsigned long flags;
> -	unsigned long wait;
> +	unsigned long timeout;
>  
>  	if (reboot_force)
>  		return;
> @@ -179,9 +179,12 @@ static void native_smp_send_stop(void)
>  	if (num_online_cpus() > 1) {
>  		apic->send_IPI_allbutself(REBOOT_VECTOR);
>  
> -		/* Don't wait longer than a second */
> -		wait = USEC_PER_SEC;
> -		while (num_online_cpus() > 1 && wait--)
> +		/*
> +		 * Don't wait longer than a second if the caller
> +		 * didn't ask us to wait.
> +		 */
> +		timeout = USEC_PER_SEC;
> +		while (num_online_cpus() > 1 && (wait || timeout--))
>  			udelay(1);
>  	}
>  
> @@ -227,7 +230,7 @@ struct smp_ops smp_ops = {
>  	.smp_prepare_cpus	= native_smp_prepare_cpus,
>  	.smp_cpus_done		= native_smp_cpus_done,
>  
> -	.smp_send_stop		= native_smp_send_stop,
> +	.stop_other_cpus	= native_stop_other_cpus,
>  	.smp_send_reschedule	= native_smp_send_reschedule,
>  
>  	.cpu_up			= native_cpu_up,
> Index: linux-2.6/arch/x86/xen/enlighten.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/xen/enlighten.c	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/xen/enlighten.c	2010-10-11 11:58:53.000000000 -0700
> @@ -1018,7 +1018,7 @@ static void xen_reboot(int reason)
>  	struct sched_shutdown r = { .reason = reason };
>  
>  #ifdef CONFIG_SMP
> -	smp_send_stop();
> +	stop_other_cpus();
>  #endif
>  
>  	if (HYPERVISOR_sched_op(SCHEDOP_shutdown, &r))
> Index: linux-2.6/arch/x86/xen/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/xen/smp.c	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/xen/smp.c	2010-10-11 11:58:53.000000000 -0700
> @@ -400,9 +400,9 @@ static void stop_self(void *v)
>  	BUG();
>  }
>  
> -static void xen_smp_send_stop(void)
> +static void xen_stop_other_cpus(int wait)
>  {
> -	smp_call_function(stop_self, NULL, 0);
> +	smp_call_function(stop_self, NULL, wait);
>  }
>  
>  static void xen_smp_send_reschedule(int cpu)
> @@ -470,7 +470,7 @@ static const struct smp_ops xen_smp_ops 
>  	.cpu_disable = xen_cpu_disable,
>  	.play_dead = xen_play_dead,
>  
> -	.smp_send_stop = xen_smp_send_stop,
> +	.stop_other_cpus = xen_stop_other_cpus,
>  	.smp_send_reschedule = xen_smp_send_reschedule,
>  
>  	.send_call_func_ipi = xen_smp_send_call_function_ipi,
> Index: linux-2.6/include/linux/smp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/smp.h	2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/include/linux/smp.h	2010-10-11 12:07:19.000000000 -0700
> @@ -116,7 +116,8 @@ extern unsigned int setup_max_cpus;
>  
>  #else /* !SMP */
>  
> -static inline void smp_send_stop(void) { }
> +static inline void smp_send_stop() { }
> +static inline void stop_other_cpus() { }

You shouldn't change smp.h.

For the moment stop_other_cpus() is x86 specific so we shouldn't
impose it in other architectures. 

Also note removing the explicit void is a bad idea in C because that is
equivalent to saying: smp_send_stop(...) { }  Which allows any set
of parameters you can think of.

Eric




More information about the kexec mailing list