Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing

Sudeep Holla sudeep.holla at arm.com
Mon Mar 16 12:16:05 PDT 2015



On 16/03/15 18:16, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 05:47:46PM +0000, Sudeep Holla wrote:
>> Hi Russell,
>>
>> I was able to see exact behaviour on my VExpress setup with CA9X4 core-tile.
>> Few observations from my side:
>>
>> 1. This issue can be reproduced even on v3.19
>> 2. As you suspected L2C, I tried disabling L2C and it seems to solve
>>     the issue
>
> My L2C says it's cache ID is 0x410000c3 - which is indeed a L2C-310, but
> with an undocumented revision ID of 3, which as far as we can make out,
> it's a R1Px where x > 0.
>
>> 3. Since it's very random and enabling LL_DEBUG made it difficult to
>>     reproduce the issue, I tried to dump the stack using DS5 debugger
>> 4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
>>     and on multiple runs
>
> Hmm, I haven't seen them before I moved to 4.0-rc3 - before then my
> nightly boot tests (which run two boots on the platform each night)
> always seemed to succeed.
>
>> 5. Connecting to h/w debugger, stopping and re-starting the CPUs,
>>     solves the issue. It's helping CPUs to get out of __radix_tree_lookup
>>     somehow
>
> Interesting.  Are the traces below from 4.0-rc3 or an older kernel?
>

This one is with v3.19 but I get exact same trace with v4.0-rc* kernel.

>> Stacktrace
>> ==========
>> (sorry it's looks different from std. Linux backtrace as this one id dump
>> from DS5)
>>
>> CPU 0
>> ----
>> #0 __radix_tree_lookup( root = <Value currently has no location>, index =
>> 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
>> radix-tree.c:517
>
> Can you dump the disassembly around this location for both CPU0 and CPU1
> and the register values please?  I think it would be interesting to see
> if they're both stuck on exactly the same address access.
>

(with v4.0-rc4 this time)

CPU#0
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
     desc = <Value optimised away by compiler>
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq 
= 16, lookup = <Value currently has no location>, regs = <Value 
currently has no location> ) at irqdesc.c:391
     old_regs = (struct pt_regs*) 0x0
     irq = <Value optimised away by compiler>
     ret = 0
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
     irqstat = 2147518036
     irqnr = <Value currently has no location>
     gic = <Value optimised away by compiler>
     cpu_base = (void*) 0xC0802100
#5 [__irq_svc+0x40]

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0x00000000
R7           0x00000000
R8           0xBF004400
R9           0x805F1F90
R10          0x00000001
R11          0x805EEB08
R12          0xBEC00001
SP           0x805F1EFC
LR           0x00000000
PC           0x8021F820
CPSR         0x80000193

CPU#1
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags = 
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
     desc = <Value optimised away by compiler>
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
     cpu = 1
     flags = <Value currently has no location>
     desc = <Value optimised away by compiler>
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value 
in stack frame for register R0>, action = <Value currently has no 
location>, hcpu = <Value not available : Undefined value in stack frame 
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val = 
<Value not available : Undefined value in stack frame for register R1>, 
v = <Value not available : Undefined value in stack frame for register 
R2>, nr_to_call = <Value not available : Undefined value in stack frame 
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
     ret = <Value currently has no location>
     nb = <Value optimised away by compiler>
     next_nb = <Value optimised away by compiler>
#5 notifier_to_errno( ret = <Value currently has no location> ) at 
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value 
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367
     mm = <Value optimised away by compiler>
     cpu = 1
#8 [S:0x60008724]

Disassembly:

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0xBF08BF94
R7           0x00000000
R8           0x805F92A0
R9           0x00000000
R10          0x00000000
R11          0x00000000
R12          0xBEC00001
SP           0xBF08BF6C
LR           0x00000000
PC           0x8021F820
CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

[...]

> I'm beginning to believe at this point that there /is/ a bug in the L2C on
> the test chip, and that we're probably better off changing the Versatile
> Express DT files to disable the L2C cache controller... what are your
> thoughts on that?
>

I was thinking of taking the dump of L2C register settings and comparing
them. But currently I am facing issues booting even v3.18 on my setup,
it seem to fails somewhere else which I need to look at.

> I'm currently doing up to 8 boot tests - if I can do 8 consecutive boot
> tests which all succeed, I'm declaring it a pass, otherwise it's a fail.
> Generally, I've found that it will fail very early (like the first) but
> sometimes up to the 4th.
>
> I guess one thing we need to confirm is whether we have exactly the same
> hardware and firmware versions.  Here's my board's early boot messages:
>

ARM V2M Boot loader v1.1.2
HBI0190 build 2313

ARM V2M Firmware v3.1.2
Build Date: Apr 16 2013

Date: Mon 16 Mar 2015
Time:     18:57:21

Powering up system...
Daughterboard fitted to site 1.

Switching on ATXPSU...
ATX3V3: ON
VIOset: 1.8V
MBtemp: 26 degC

Configuring motherboard (rev D, var A)...
IOFPGA  config: PASSED
MUXFPGA config: PASSED
OSC CLK config: PASSED

Testing SMC devices (FPGA build 8)...
SRAM 32MB test: PASSED
VRAM  8MB test: PASSED
LAN9118   test: PASSED
USB & OTG test: PASSED
KMI1/KMI2 test: PASSED
MMC & SD  test: PASSED
DVI image test: PASSED
AACI AC97 test: PASSED
CF card   test: PASSED
UART port test: PASSED
MAC addrs test: PASSED

Reading Site 1 Board File \SITE1\HBI0191B\board.txt
DB1 JTAG configuration complete.
Setting DB1 OSCCLKS...
DB1.0 DCC 0 SPI configuration complete.

Writing SCC 0x40610000 with 0xBB8A802A
Writing SCC 0x40610001 with 0x00001F09
Writing SCC 0x40610002 with 0x00000000
DB1.0 DCC 0 SCC configuration complete.

DB SMB clock enabled.
Waiting for SITE1 CB_READY...
Testing SMB clock...
Configuring MUXFPGA for MB.
Setting DVI mode for VGA.
Releasing Daughterboard resets.
Switching MCC log to UART1.
%BootMonitor-Warning, Unable to open SYSTEM.DAT


ARM Versatile Express Boot Monitor
Version:    V5.2.1
Build Date: Apr  4 2013
Daughterboard Site 1: V2P-CA9 Cortex A9
Daughterboard Site 2: Not Used
Running boot script from flash - BOOTSCRIPT





More information about the linux-arm-kernel mailing list