Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing

Russell King - ARM Linux linux at arm.linux.org.uk
Mon Mar 16 12:52:55 PDT 2015


On Mon, Mar 16, 2015 at 07:16:05PM +0000, Sudeep Holla wrote:
> On 16/03/15 18:16, Russell King - ARM Linux wrote:
> >Can you dump the disassembly around this location for both CPU0 and CPU1
> >and the register values please?  I think it would be interesting to see
> >if they're both stuck on exactly the same address access.
> 
> (with v4.0-rc4 this time)

Thanks.

> CPU#0
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0x00000000
> R7           0x00000000
> R8           0xBF004400
> R9           0x805F1F90
> R10          0x00000001
> R11          0x805EEB08
> R12          0xBEC00001
> SP           0x805F1EFC
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x80000193
> 
> CPU#1
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0xBF08BF94
> R7           0x00000000
> R8           0x805F92A0
> R9           0x00000000
> R10          0x00000000
> R11          0x00000000
> R12          0xBEC00001
> SP           0xBF08BF6C
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

And we find that both CPUs have stopped at exactly the same place, which
is an arithmetic instruction.

If I had to guess, I'd say the reason it's stopped there (exactly on a
cache line boundary) is because both CPUs are waiting for an instruction
fetch to complete into its L1 I-cache, and for some reason, the L2
cache is not satisfying the request from either CPU.  The question of
course is... why not.

> >I guess one thing we need to confirm is whether we have exactly the same
> >hardware and firmware versions.  Here's my board's early boot messages:

Looks like we're broadly the same, apart from the boot loader version.
You have 1.1.2, whereas I have 1.1.1.

Co-incidentally, I just looked at the disassembly of my __radix_tree_lookup:

c0199750:       e0050495        mul     r5, r5, r4
c0199754:       e2455006        sub     r5, r5, #6
c0199758:       ea000000        b       c0199760 <__radix_tree_lookup+0x70>
c019975c:       e1a0c000        mov     ip, r0
c0199760:       e1a06531        lsr     r6, r1, r5
c0199764:       e206603f        and     r6, r6, #63     ; 0x3f
c0199768:       e2866006        add     r6, r6, #6
c019976c:       e79c0106        ldr     r0, [ip, r6, lsl #2]

The code is slightly different, but notice that the alignment of the
LSR instruction is the same as yours - at first I wondered whether that's
coincidence or not.  However, taking Olof's MMC changes back out of my
tree (which results in a booting kernel) makes no difference to the
placement of this code.

The start of the read-only data section doesn't change between the working
and non-working kernels, but the location of the spinlock and some scheduler
code does change (along with all the networking code.)

There's changes in the read-only data section, there's also changes to a
set of "descriptor.NNNNN" symbols towards the end of the data section,
which goes on to change the placement of the bss section.

The diff between the System.map is unpostable - it's about 1.3MB. :(

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.



More information about the linux-arm-kernel mailing list