Linux fails to start secondary cores when system resumes from Suspend-to-RAM

Mason slash.tmp at free.fr
Thu Dec 29 04:10:27 PST 2016


On 16/12/2016 08:25, Mason wrote:

> On 16/12/2016 06:14, Yu Chen wrote:
> 
>> On Thu, Dec 15, 2016 at 11:18 PM, Mason wrote:
>>
>>> I'm playing with suspend-to-RAM on the tango platform:
>>>
>>>   http://lxr.free-electrons.com/source/arch/arm/mach-tango/platsmp.c
>>>
>>> When the system is suspended, the CPU is completely powered down
>>> (receives no power whatsoever). When the system receives a wake-up
>>> event, the CPU is powered up, and starts up exactly the same way
>>> as for a cold boot (I think).
>>>
>>> However, while Linux successfully starts the secondary cores when
>>> the system first boots, it fails when the system resumes from "S3".
>>>
>>> I added printascii() calls inside secondary_start_kernel() and I can
>>> see that the following instruction are "properly" run:
>>>
>>>         cpu_switch_mm(mm->pgd, mm);
>>>         local_flush_bp_all();
>>>         enter_lazy_tlb(mm, current);
>>>
>>> but it seems local_flush_tlb_all(); never returns... :-(
>>>
>>>   http://lxr.free-electrons.com/source/arch/arm/include/asm/tlbflush.h#L332
>>>
>>>
>>> Looking more closely at that function, it seems to be failing in:
>>>
>>>         tlb_op(TLB_V7_UIS_FULL, "c8, c7, 0", zero);
>>>
>>> (meaning: I get a log before, but not after)
>>>
>>> On my system, tlb_op(TLB_V7_UIS_FULL, "c8, c7, 0", zero);
>>> resolves to:
>>>
>>> c010ce18:       e3170602        tst     r7, #2097152    ; 0x200000
>>> c010ce1c:       1e086f17        mcrne   15, 0, r6, cr8, cr7, {0}
>>>
>>> What could be happening?
>>> Can a core "hang" on this instruction?
>>> Can a core "crash" on this instruction (meaning, an exception
>>> is raised, and the core loops inside the exception code without
>>> Linux noticing... that seems unlikely)
>>
>> try online/offline the nonboot CPUs via
>> /sys/devices/system/cpu/cpuX/online
> 
> offline + online secondary core works.
> 
> Note: all cores are in the same power domain, so even if all
> secondary cores are offline, the CPU block remains powered up
> (secondary cores are just held in reset, or spinning in WFI,
> depending on the firmware version).
> 
> When the system is suspended, the CPU block (as well as 99%
> of the system) is powered down. Thus, upon resume, all cores
> will run the boot sequence (again).
> 
> I'm guessing that something goes wrong during this second
> boot sequence. Could there be a race condition between the
> primary core and one of the secondary cores?
> 
> What is different in the Linux boot sequence between cold
> boot and resume? I'm thinking that the state stored in RAM
> is in fact incompatible with what Linux expects when it resumes...

I've taken a closer look at the MCR instruction.

    tlb_op(TLB_V7_UIS_FULL, "c8, c7, 0", zero);

#define tlb_op(f, regs, arg)	__tlb_op(f, "p15, 0, %0, " regs, arg)

#define __tlb_op(f, insnarg, arg)					\
	do {								\
		if (always_tlb_flags & (f))				\
			asm("mcr " insnarg				\
			    : : "r" (arg) : "cc");			\
		else if (possible_tlb_flags & (f))			\
			asm("tst %1, %2\n\t"				\
			    "mcrne " insnarg				\
			    : : "r" (arg), "r" (__tlb_flag), "Ir" (f)	\
			    : "cc");					\
	} while (0)


c010dd64:       e3130c12        tst     r3, #4608       ; 0x1200
c010dd68:       1e081f17        mcrne   15, 0, r1, cr8, cr7, {0}
c010dd6c:       e3120602        tst     r2, #2097152    ; 0x200000
c010dd70:       1e081f17        mcrne   15, 0, r1, cr8, cr7, {0}
c010dd74:       f57ff047        dsb     un
c010dd78:       f57ff06f        isb     sy

A8.6.92 MCR, MCR2

Move to Coprocessor from ARM core register passes the value of an ARM core register to a coprocessor.
If no coprocessor can execute the instruction, an Undefined Instruction exception is generated.

This is a generic coprocessor instruction. Some of the fields have no functionality defined by the architecture
and are free for use by the coprocessor instruction set designer. These fields are the opc1, opc2, CRn, and
CRm fields.

Operation

if ConditionPassed() then
	EncodingSpecificOperations();
	if !Coproc_Accepted(cp, ThisInstr()) then
		GenerateCoprocessorException();
	else
		Coproc_SendOneWord(R[t], cp, ThisInstr());


I.7.5 Coproc_Accepted()

This function determines, for a coprocessor and one of its coprocessor instructions:
- Whether access to the coprocessor is permitted by the CPACR and, if the Security Extensions are
implemented, the NSACR.
- If access is permitted, whether the instruction is accepted by the coprocessor. The coprocessor
architecture definition specifies which instructions it accepts and in what circumstances.
It returns TRUE if access is permitted and the coprocessor accepts the instruction, and FALSE otherwise.

boolean Coproc_Accepted(integer cp_num, bits(32) instr)


I.7.17 GenerateCoprocessorException()

This procedure generates the appropriate exception for a rejected coprocessor instruction.
In all architecture variants and profiles described in this manual, GenerateCoprocessorException() generates
an Undefined Instruction exception.


Assuming I'm hitting this, I would expect the kernel to print something
if the secondary cores trigger an exception. Am I mistaken?


mov     r1, #0
mcrne   15, 0, r1, cr8, cr7, {0}

B3.12.34 CP15 c8, TLB maintenance operations
On ARMv7-A implementations, CP15 c8 operations are used for TLB maintenance functions. Figure B3-20
shows the CP15 c8 encodings.

TLBIALL*, invalidate unified TLB
* See text for more information about these mnemonics
Invalidate entire unified TLB
(When separate instruction and data TLBs are implemented,
these operations are performed on both TLBs.)
Rt data = Ignore

Invalidate entire TLB
The Invalidate entire TLB operations invalidate all unlocked entries in the TLB. The value in the register Rt
specified by the MCR instruction used to perform the operation is ignored. You do not have to write a value
to the register before issuing the MCR instruction.


Maybe this is a red herring... I don't see why the TLBIALL would
raise an exception when the system resumes... I first suspected
TrustZone shenanigans, but it looks like TLBIALL just flushes
what it can, so TrustZone seems to be a non-issue.

I'm stumped... Still looking for a clue :-(

Maybe I should instrument the exception handler...
if I only knew where it was.

Regards.




More information about the linux-arm-kernel mailing list