[PATCH v4 4/7] arm64: Handle early CPU boot failures
Mark Rutland
mark.rutland at arm.com
Wed Feb 3 09:01:15 PST 2016
On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K Poulose wrote:
> From: Suzuki K. Poulose <suzuki.poulose at arm.com>
>
> A secondary CPU could fail to come online due to insufficient
> capabilities and could simply die or loop in the kernel.
> e.g, a CPU with no support for the selected kernel PAGE_SIZE
> loops in kernel with MMU turned off.
> or a hotplugged CPU which doesn't have one of the advertised
> system capability will die during the activation.
>
> There is no way to synchronise the status of the failing CPU
> back to the master. This patch solves the issue by adding a
> field to the secondary_data which can be updated by the failing
> CPU. If the secondary CPU fails even before turning the MMU on,
> it updates the status in a special variable reserved in the head.txt
> section to make sure that the update can be cache invalidated safely
> without possible sharing of cache write back granule.
>
> Here are the possible states :
>
> -1. CPU_MMU_OFF - Initial value set by the master CPU, this value
> indicates that the CPU could not turn the MMU on, hence the status
> could not be reliably updated in the secondary_data. Instead, the
> CPU has updated the status in __early_cpu_boot_status (reserved in
> head.txt section)
>
> 0. CPU_BOOT_SUCCESS - CPU has booted successfully.
>
> 1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the
> master CPU to synchronise by issuing a cpu_ops->cpu_kill.
>
> 2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is
> looping in the kernel. This information could be used by say,
> kexec to check if it is really safe to do a kexec reboot.
>
> 3. CPU_PANIC_KERNEL - CPU detected some serious issues which
> requires kernel to crash immediately. The secondary CPU cannot
> call panic() until it has initialised the GIC. This flag can
> be used to instruct the master to do so.
When would we use this last case?
Perhaps a better option is to always throw any incompatible CPU into an
(MMU-off) pen, and assume that it's stuck in the kernel, even if we
could theoretically turn it off.
A system with incompatible CPUs is a broken system to begin with...
Otherwise, comments below.
> Cc: Will Deacon <will.deacon at arm.com>
> Cc: Mark Rutland <mark.rutland at arm.com>
> Cc: Catalin Marinas <catalin.marinas at arm.com>
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose at arm.com>
> ---
> arch/arm64/include/asm/smp.h | 26 ++++++++++++++++++++++
> arch/arm64/kernel/asm-offsets.c | 2 ++
> arch/arm64/kernel/head.S | 39 +++++++++++++++++++++++++++++---
> arch/arm64/kernel/smp.c | 47 +++++++++++++++++++++++++++++++++++++++
> 4 files changed, 111 insertions(+), 3 deletions(-)
[...]
> @@ -650,6 +679,10 @@ __enable_mmu:
> ENDPROC(__enable_mmu)
>
> __no_granule_support:
> + /* Indicate that this CPU can't boot and is stuck in the kernel */
> + update_early_cpu_boot_status x2, CPU_STUCK_IN_KERNEL
> +1:
> wfe
> - b __no_granule_support
> + wfi
> + b 1b
The addition of wfi seems fine, but should be mentioned in the commit
message.
> ENDPROC(__no_granule_support)
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 59f032b..d2721ae 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -63,6 +63,8 @@
> * where to place its SVC stack
> */
> struct secondary_data secondary_data;
> +/* Number of CPUs which aren't online, but looping in kernel text. */
> +u32 cpus_stuck_in_kernel;
Why u32 rather than int?
>
> enum ipi_msg_type {
> IPI_RESCHEDULE,
> @@ -72,6 +74,16 @@ enum ipi_msg_type {
> IPI_IRQ_WORK,
> };
>
> +#ifdef CONFIG_HOTPLUG_CPU
> +static int op_cpu_kill(unsigned int cpu);
> +#else
> +static inline int op_cpu_kill(unsigned int cpu)
> +{
> + return -ENOSYS;
> +}
> +#endif
There is no !CONFIG_HOTPLUG_CPU configuration any more.
> +
> +
> /*
> * Boot a secondary CPU, and assign it the specified idle task.
> * This also gives us the initial stack to use for this CPU.
> @@ -89,12 +101,14 @@ static DECLARE_COMPLETION(cpu_running);
> int __cpu_up(unsigned int cpu, struct task_struct *idle)
> {
> int ret;
> + int status;
>
> /*
> * We need to tell the secondary core where to find its stack and the
> * page tables.
> */
> secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
> + update_cpu_boot_status(CPU_MMU_OFF);
> __flush_dcache_area(&secondary_data, sizeof(secondary_data));
>
> /*
> @@ -117,7 +131,35 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
> pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
> }
>
> + /* Make sure the update to status is visible */
> + smp_rmb();
> secondary_data.stack = NULL;
> + status = READ_ONCE(secondary_data.status);
What is the rmb intended to order here?
> + if (ret && status) {
> +
> + if (status == CPU_MMU_OFF)
> + status = READ_ONCE(__early_cpu_boot_status);
> +
> + switch (status) {
> + default:
> + pr_err("CPU%u: failed in unknown state : 0x%x\n",
> + cpu, status);
> + break;
> + case CPU_KILL_ME:
> + if (!op_cpu_kill(cpu)) {
> + pr_crit("CPU%u: died during early boot\n", cpu);
> + break;
> + }
> + /* Fall through */
> + pr_crit("CPU%u: may not have shut down cleanly\n", cpu);
> + case CPU_STUCK_IN_KERNEL:
> + pr_crit("CPU%u: is stuck in kernel\n", cpu);
> + cpus_stuck_in_kernel++;
> + break;
> + case CPU_PANIC_KERNEL:
> + panic("CPU%u detected unsupported configuration\n", cpu);
> + }
> + }
>
> return ret;
> }
> @@ -185,6 +227,9 @@ asmlinkage void secondary_start_kernel(void)
> */
> pr_info("CPU%u: Booted secondary processor [%08x]\n",
> cpu, read_cpuid_id());
> + update_cpu_boot_status(CPU_BOOT_SUCCESS);
> + /* Make sure the status update is visible before we complete */
> + smp_wmb();
Surely complete() has appropriate barriers?
> set_cpu_online(cpu, true);
> complete(&cpu_running);
>
> @@ -327,10 +372,12 @@ void cpu_die_early(void)
> set_cpu_present(cpu, 0);
>
> #ifdef CONFIG_HOTPLUG_CPU
> + update_cpu_boot_status(CPU_KILL_ME);
> /* Check if we can park ourselves */
> if (cpu_ops[cpu] && cpu_ops[cpu]->cpu_die)
> cpu_ops[cpu]->cpu_die(cpu);
I think you need a barrier to ensure visibility of the store prior to
calling cpu_die (i.e. you want to order an access against execution). A
dsb is what you want -- smp_wmb() only expands to a dmb.
> #endif
> + update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>
> cpu_park_loop();
Likewise here.
Mark.
More information about the linux-arm-kernel
mailing list