[PATCH v2 21/21] arm64: Panic when VHE and non VHE CPUs coexist

Christoffer Dall christoffer.dall at linaro.org
Wed Feb 3 00:49:13 PST 2016


On Tue, Feb 02, 2016 at 03:32:04PM +0000, Marc Zyngier wrote:
> On 01/02/16 15:36, Christoffer Dall wrote:
> > On Mon, Jan 25, 2016 at 03:53:55PM +0000, Marc Zyngier wrote:
> >> Having both VHE and non-VHE capable CPUs in the same system
> >> is likely to be a recipe for disaster.
> >>
> >> If the boot CPU has VHE, but a secondary is not, we won't be
> >> able to downgrade and run the kernel at EL1. Add CPU hotplug
> >> to the mix, and this produces a terrifying mess.
> >>
> >> Let's solve the problem once and for all. If you mix VHE and
> >> non-VHE CPUs in the same system, you deserve to loose, and this
> >> patch makes sure you don't get a chance.
> >>
> >> This is implemented by storing the kernel execution level in
> >> a global variable. Secondaries will park themselves in a
> >> WFI loop if they observe a mismatch. Also, the primary CPU
> >> will detect that the secondary CPU has died on a mismatched
> >> execution level. Panic will follow.
> >>
> >> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com>
> >> ---
> >>  arch/arm64/include/asm/virt.h | 17 +++++++++++++++++
> >>  arch/arm64/kernel/head.S      | 19 +++++++++++++++++++
> >>  arch/arm64/kernel/smp.c       |  3 +++
> >>  3 files changed, 39 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> >> index 9f22dd6..f81a345 100644
> >> --- a/arch/arm64/include/asm/virt.h
> >> +++ b/arch/arm64/include/asm/virt.h
> >> @@ -36,6 +36,11 @@
> >>   */
> >>  extern u32 __boot_cpu_mode[2];
> >>  
> >> +/*
> >> + * __run_cpu_mode records the mode the boot CPU uses for the kernel.
> >> + */
> >> +extern u32 __run_cpu_mode[2];
> >> +
> >>  void __hyp_set_vectors(phys_addr_t phys_vector_base);
> >>  phys_addr_t __hyp_get_vectors(void);
> >>  
> >> @@ -60,6 +65,18 @@ static inline bool is_kernel_in_hyp_mode(void)
> >>  	return el == CurrentEL_EL2;
> >>  }
> >>  
> >> +static inline bool is_kernel_mode_mismatched(void)
> >> +{
> >> +	/*
> >> +	 * A mismatched CPU will have written its own CurrentEL in
> >> +	 * __run_cpu_mode[1] (initially set to zero) after failing to
> >> +	 * match the value in __run_cpu_mode[0]. Thus, a non-zero
> >> +	 * value in __run_cpu_mode[1] is enough to detect the
> >> +	 * pathological case.
> >> +	 */
> >> +	return !!ACCESS_ONCE(__run_cpu_mode[1]);
> >> +}
> >> +
> >>  /* The section containing the hypervisor text */
> >>  extern char __hyp_text_start[];
> >>  extern char __hyp_text_end[];
> >> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> >> index 2a7134c..bc44cf8 100644
> >> --- a/arch/arm64/kernel/head.S
> >> +++ b/arch/arm64/kernel/head.S
> >> @@ -577,7 +577,23 @@ ENTRY(set_cpu_boot_mode_flag)
> >>  1:	str	w20, [x1]			// This CPU has booted in EL1
> >>  	dmb	sy
> >>  	dc	ivac, x1			// Invalidate potentially stale cache line
> >> +	adr_l	x1, __run_cpu_mode
> >> +	ldr	w0, [x1]
> >> +	mrs	x20, CurrentEL
> >> +	cbz	x0, skip_el_check
> >> +	cmp	x0, x20
> >> +	bne	mismatched_el
> > 
> > can't you do a ret here instead of writing the same value and flushing
> > caches etc.?
> 
> Yes, good point.
> 
> > 
> >> +skip_el_check:			// Only the first CPU gets to set the rule
> >> +	str	w20, [x1]
> >> +	dmb	sy
> >> +	dc	ivac, x1	// Invalidate potentially stale cache line
> >>  	ret
> >> +mismatched_el:
> >> +	str	w20, [x1, #4]
> >> +	dmb	sy
> >> +	dc	ivac, x1	// Invalidate potentially stale cache line
> >> +1:	wfi
> > 
> > I'm no expert on SMP bringup, but doesn't this prevent the CPU from
> > signaling completion and thus you'll never actually reach the checking
> > code in __cpu_up?
> 
> Indeed, and that's the whole point. The primary CPU will notice that the
> secondary CPU has failed to boot (timeout), and will find the reason in
> __run_cpu_mode.
> 
That wasn't exactly my point.  If I understand correctly and __cpu_up is
the primary CPU executing a function to bring up a secondary core, then
it will wait for the cpu_running completion which should be signalled by
the secondary core, but because the secondary core never makes any
progress it will timeout the wait for completion and you will see that
error "..failed to come online" instead of the "incompatible execution
level".

(This is based on my reading of the code as the completion is signalled
in secondary_start_kernl which happens after this stuff above in
head.S).

-Christoffer



More information about the linux-arm-kernel mailing list