[RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading

Tue Nov 3 11:00:20 PST 2015

Hi,

On Tue, Nov 3, 2015 at 3:30 AM, Will Deacon <will.deacon at arm.com> wrote:
> On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote:
>> As the following log:
>> where we experience a CPU hard lockup. The assembly code (disassembled by gdb)
>>
>> 0xc06c6e90 <__tcp_select_window+148>:        beq     0xc06c6eb0<__tcp_select_window+180>
>> 0xc06c6e94 <__tcp_select_window+152>:        mov     r2, #1008; 0x3f0
>> 0xc06c6e98 <__tcp_select_window+156>:        ldr     r5, [r0,#1004] ; 0x3ec
>> 0xc06c6e9c <__tcp_select_window+160>:        ldrh    r2, [r0,r2]
>> ....
>>
>> 0xc06c6ee0 <__tcp_select_window+228>:        addne   r0, r0, #1
>> 0xc06c6ee4 <__tcp_select_window+232>:        lslne   r0, r0, r2
>> 0xc06c6ee8 <__tcp_select_window+236>:        ldmne   sp, {r4, r5,r11, sp,pc}
>>
>> Could either the “strhi”/”strlo” pair, or the lslne/ldmne pair, be
>> tripping over errata 818325, or a similar errata?
>
> No. One of the conditions for #818325 is:
>
>   The second instruction is an UNPREDICTABLE STR or STM (maximum two2
>   registers in the list) with write-back and the write-back register is
>   in the list of stored registers.
>
> I don't see either of those in your code snippet above, but then I don't
> see your strhi/strlo either. What's going on?

It looks like Caesar is proposing that this errata is the root cause
for some hard lockups we're seeing on rk3288 Chromebooks.  I agree
with folks here that say this isn't terribly likely, but I always like
to be proven wrong.  ;)

We've got code that samples / prints CPU_DBGPCSR at the time of a hard
lockup.  That register isn't 100% accurate about where a CPU is, but
it's better than nothing (technically there may be ways to actually
use the DBG registers to stop the remote CPU and maybe give more info,
but I digress).

When CPUs are hard locked up, they are often found at:

<c0117c8c> v7_coherent_kern_range+0x58/0x74
  or
<c0118278> v7wbi_flush_user_tlb_range+0x30/0x38

That made me think that an errata might be the root cause of our hard
lockups, since ARM errata often trigger in cache/tlb functions.  I
think Caesar dug up this old errata fix in response to my suggestion.

If you know of any ARM errata that might trigger hard lockups like
this, I'd certainly be all ears.  It's also possible that we've got
something running at too low of a voltage or we've got clock dividers
or cache timings programmed incorrectly somewhere.  To give a more
full disassembly of one of the crashes:

  <4>[ 1623.480846] SMP: failed to stop secondary CPUs
  <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88
  <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74
  <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38

---

c01827dc:       e2841010        add     r1, r4, #16
c01827e0:       e2445004        sub     r5, r4, #4
c01827e4:       eb068d33        bl      c0325cb8 <plist_del> (File
Offset: 0x235cb8)
=> c01827e8:    f595f000        pldw    [r5]
c01827ec:       e1953f9f        ldrex   r3, [r5]
c01827f0:       e2433001        sub     r3, r3, #1
c01827f4:       e1852f93        strex   r2, r3, [r5]
c01827f8:       e3320000        teq     r2, #0
c01827fc:       1afffffa        bne     c01827ec
<__unqueue_futex+0x6c> (File Offset: 0x927ec)
c0182800:       e89da830        ldm     sp, {r4, r5, fp, sp, pc}

---

c0117c80:       e08cc002        add     ip, ip, r2
c0117c84:       e15c0001        cmp     ip, r1
c0117c88:       3afffffb        bcc     c0117c7c
<v7_coherent_kern_range+0x48> (File Offset: 0x27c7c)
=> c0117c8c:    e3a00000        mov     r0, #0
c0117c90:       ee070fd1        mcr     15, 0, r0, cr7, cr1, {6}
c0117c94:       f57ff04a        dsb     ishst
c0117c98:       f57ff06f        isb     sy
c0117c9c:       e1a0f00e        mov     pc, lr

---

c0118260:       e1830600        orr     r0, r3, r0, lsl #12
c0118264:       e1a01601        lsl     r1, r1, #12
=> c0118268:    ee080f33        mcr     15, 0, r0, cr8, cr3, {1}
c011826c:       e2800a01        add     r0, r0, #4096   ; 0x1000
c0118270:       e1500001        cmp     r0, r1
c0118274:       3afffffb        bcc     c0118268
<v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268)
c0118278:       f57ff04b        dsb     ish
c011827c:       e1a0f00e        mov     pc, lr