[LEDE-DEV] Procd gets SIGILL during 'early' state on glibc builds

Tue Aug 22 06:47:33 PDT 2017

On a glibc build, procd sometimes terminates with SIGILL, and since
it's pid 1, a panic occurs. This is with HEAD of source.git, procd,
libubox and ubus, as well as older (2014) builds. The platform is a
Cortex A9 SoC with kernels 3.2 and 3.19 available:

[    6.356037] mount_root: mounting /dev/root
[    6.377581] procd: - early -
[    6.563175] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x00000004

If I fork a "gdb -attach 1" from init before it exec's to procd, I can
do a 'bt':

#0  0x76f09962 in uloop_setup_signals (add=<optimized out>)
    at /home/bogdan/lede/build_dir/target-arm_cortex-a9_glibc-2.21_eabi/libubox-2016-02-263
#1  0x0000000a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) disas
Dump of assembler code for function uloop_setup_signals:
...
   0x76f09934 <+48>:    add     r2, r4, #144    ; 0x90
   0x76f09938 <+52>:    mov     r3, r5
   0x76f0993c <+56>:    mov     r1, r6
   0x76f09940 <+60>:    mov     r0, #15
   0x76f09944 <+64>:    bl      0x76f09874 <uloop_install_handler>
   0x76f09948 <+68>:    ldr     r1, [pc, #96]   ; 0x76f099b0
<uloop_setup_signals+172>
   0x76f0994c <+72>:    add     r2, r4, #284    ; 0x11c
   0x76f09950 <+76>:    mov     r3, r5
   0x76f09954 <+80>:    mov     r0, #17
   0x76f09958 <+84>:    add     r4, sp, #4
   0x76f0995c <+88>:    add     r1, pc, r1
   0x76f09960 <+92>:    bl      0x76f09874 <uloop_install_handler>
   0x76f09964 <+96>:    mov     r2, r4
   0x76f09968 <+100>:   mov     r1, #0
   0x76f0996c <+104>:   mov     r0, #13
   0x76f09970 <+108>:   bl      0x76f07dcc <sigaction at plt>
   0x76f09974 <+112>:   cmp     r5, #0
...

And 0x76f09962 from the bt is in the middle of the 'bl' opcode at 9960.

Since uloop_setup_signals is called directly via main -> uloop_run, is
it possible that the pc printed by gdb from SIGILL is not right ?

I used __cyg_profile_func_enter/exit to record all function
calls/exits in a circular queue and dumping this from gdb it seems the
stack should be different. When the SIGILL arrives, the call queue
usually ends with:

json_process_expr
  __json_process_type
    handle_expr_regex
      expr_eq_regex
        json_get_tuple
        ...
        msg_find_var
        ...
        blobmsg_type
        blobmsg_data
        eq_regex_cmp
          ...and possibly some libc call here which __cyg_... doesn't record

(indent level shows call depth, but calls at each level are recorded
as well since it's a queue not just a stack).

The crash happens about once in 20 reboots. Things that prevent it
from happening:

- building userspace with uclibc instead of glibc
- kernel 3.2 instead of 3.19
- running init under valgrind

I've compiled init and its libs with -fsanitize=address, libssp, but
these didn't turn up anything.

My first question is, what other reasons besides ending up in invalid
ARM code are there for receiving a SIGILL ? Can this be related to
accessing invalid memory, calling privileged opcodes, or some other
illegal action the process is doing ?

The second question is, under what conditions could the kernel send a
SIGILL with an unrelated pc pointer ?

Thanks,
Bogdan