[LEDE-DEV] Procd gets SIGILL during 'early' state on glibc builds
Bogdan Harjoc
harjoc at gmail.com
Tue Aug 22 06:47:33 PDT 2017
On a glibc build, procd sometimes terminates with SIGILL, and since
it's pid 1, a panic occurs. This is with HEAD of source.git, procd,
libubox and ubus, as well as older (2014) builds. The platform is a
Cortex A9 SoC with kernels 3.2 and 3.19 available:
[ 6.356037] mount_root: mounting /dev/root
[ 6.377581] procd: - early -
[ 6.563175] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x00000004
If I fork a "gdb -attach 1" from init before it exec's to procd, I can
do a 'bt':
#0 0x76f09962 in uloop_setup_signals (add=<optimized out>)
at /home/bogdan/lede/build_dir/target-arm_cortex-a9_glibc-2.21_eabi/libubox-2016-02-263
#1 0x0000000a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) disas
Dump of assembler code for function uloop_setup_signals:
...
0x76f09934 <+48>: add r2, r4, #144 ; 0x90
0x76f09938 <+52>: mov r3, r5
0x76f0993c <+56>: mov r1, r6
0x76f09940 <+60>: mov r0, #15
0x76f09944 <+64>: bl 0x76f09874 <uloop_install_handler>
0x76f09948 <+68>: ldr r1, [pc, #96] ; 0x76f099b0
<uloop_setup_signals+172>
0x76f0994c <+72>: add r2, r4, #284 ; 0x11c
0x76f09950 <+76>: mov r3, r5
0x76f09954 <+80>: mov r0, #17
0x76f09958 <+84>: add r4, sp, #4
0x76f0995c <+88>: add r1, pc, r1
0x76f09960 <+92>: bl 0x76f09874 <uloop_install_handler>
0x76f09964 <+96>: mov r2, r4
0x76f09968 <+100>: mov r1, #0
0x76f0996c <+104>: mov r0, #13
0x76f09970 <+108>: bl 0x76f07dcc <sigaction at plt>
0x76f09974 <+112>: cmp r5, #0
...
And 0x76f09962 from the bt is in the middle of the 'bl' opcode at 9960.
Since uloop_setup_signals is called directly via main -> uloop_run, is
it possible that the pc printed by gdb from SIGILL is not right ?
I used __cyg_profile_func_enter/exit to record all function
calls/exits in a circular queue and dumping this from gdb it seems the
stack should be different. When the SIGILL arrives, the call queue
usually ends with:
json_process_expr
__json_process_type
handle_expr_regex
expr_eq_regex
json_get_tuple
...
msg_find_var
...
blobmsg_type
blobmsg_data
eq_regex_cmp
...and possibly some libc call here which __cyg_... doesn't record
(indent level shows call depth, but calls at each level are recorded
as well since it's a queue not just a stack).
The crash happens about once in 20 reboots. Things that prevent it
from happening:
- building userspace with uclibc instead of glibc
- kernel 3.2 instead of 3.19
- running init under valgrind
I've compiled init and its libs with -fsanitize=address, libssp, but
these didn't turn up anything.
My first question is, what other reasons besides ending up in invalid
ARM code are there for receiving a SIGILL ? Can this be related to
accessing invalid memory, calling privileged opcodes, or some other
illegal action the process is doing ?
The second question is, under what conditions could the kernel send a
SIGILL with an unrelated pc pointer ?
Thanks,
Bogdan
More information about the Lede-dev
mailing list