Occational random segfault on Cortex-A15 when exiting SIGCHLD handler.

Tue Mar 18 21:32:51 EDT 2014

I have been trying to track down the cause of some random segfaults for
the last couple of weeks.  They were mostly showing up in one particular
program (confdc which is part of confd from tail-f, but is essentially
just the erlang 14 VM with some erlang modules running and the segfault
is happening in the main erlang code).  I have also seen occasional
segfaults in gcc and even gdb.

The system is a dra7xx-evm from TI (so an eval board for
the next OMAP5 chip), running a 3.8.13 based kernel from
git://git.omapzoom.org/kernel/omap.git (they are working on mainlining
the patches, but that's not done yet as far as I can tell) branch
p-ti-linux-3.8.y

The CPU has dual Cortex-A15 at 1.5GHz.  Core revision is r2p2.

The userspace is Debian Wheezy armhf.  Running the exact same code
(except the kernel) on a cubox never segfaults.  I initially wondered if
the hardware had a problem, but the failures seem too consistent to be
a random hardware failure, and nothing else ever seems to get corrupted.

The gdb backtraces from failures so far all indicated that there is a
segfault when exiting the SIGCHLD handler, which given it is happening
in multiple different programs seems like it can't just be a coincidence.

Some example traces:

gdb dying after running it a few thousand times with the same arguments
and not failing:

(wheezydev)root at omap5:~/rmf# gdb `which gdb` ./core
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/gdb...done.
[New LWP 11964]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `gdb /usr/lib/confd/bin/confd mibs/core'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000 in ?? ()
(gdb) where
#0  0x00000000 in ?? ()
#1  0x00095a1e in sigchld_handler (signo=<error reading variable: Cannot access memory at address 0xb6bddc34>) at /root/gdb-7.4.1+dfsg/gdb/linux-nat.c:5530
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

The code in question there is:

5511 /* SIGCHLD handler that serves two purposes: In non-stop/async mode,
5512    so we notice when any child changes state, and notify the
5513    event-loop; it allows us to use sigsuspend in linux_nat_wait_1
5514    above to wait for the arrival of a SIGCHLD.  */
5515
5516 static void
5517 sigchld_handler (int signo)
5518 {
5519   int old_errno = errno;
5520
5521   if (debug_linux_nat)
5522     ui_file_write_async_safe (gdb_stdlog,
5523                               "sigchld\n", sizeof ("sigchld\n") - 1);
5524
5525   if (signo == SIGCHLD
5526       && linux_nat_event_pipe[0] != -1)
5527     async_file_mark (); /* Let the event loop know that there are
5528                            events to handle.  */
5529
5530   errno = old_errno;
5531 }

And from the erlang/confd case:

(wheezydev)root at omap5:~/rmf# gdb /usr/lib/confd/bin/confd mibs/core
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/lib/confd/bin/confd...done.
[New LWP 29581]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/usr/lib/confd/bin/confd -B -- -root /usr/lib/confd -progname confd -- -home /r'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000 in ?? ()
(gdb) where
#0  0x00000000 in ?? ()
#1  0x000b95ea in onchld (signum=<optimized out>) at sys/unix/sys.c:1150
#2  <signal handler called>
#3  0x000efd4c in __udivsi3 ()
#4  0x000efdf0 in __aeabi_uidivmod ()
#5  0x000efdf0 in __aeabi_uidivmod ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The code there is:
1138 #if (defined(SIG_SIGSET) || defined(SIG_SIGNAL))
1139 static RETSIGTYPE onchld(void)
1140 #else
1141 static RETSIGTYPE onchld(int signum)
1142 #endif
1143 {
1144 #if CHLDWTHR
1145     ASSERT(0); /* We should *never* catch a SIGCHLD signal */
1146 #elif defined(ERTS_SMP)
1147     smp_sig_notify('C');
1148 #else
1149     children_died = 1;
1150     ERTS_CHK_IO_INTR(1); /* Make sure we don't sleep in poll */
1151 #endif
1152 }

CHLDWTHR and ERTS_SMP are NOT defined/true in this case, so the last
lines are the ones in use.

I have tried to use strace on the process, and every call is exactly
the same between a working run and a segfaulting run, up until SIGCHLD
is received, and then on some runs it segfaults without making any
syscalls first.

I have tried slowing the CPU from 1.5GHz to 1.0GHz, which made no difference.

I tried running with maxcpus=1, which also made no difference.

I tried upgrading libc from 2.13 Debian wheezy) to 2.18 (Debian Jessie,
as well as the libgcc as well, and it made no difference either.

gdb seems to segfault maybe 1 in 1000 times.  confd segfaults more like
1 in 20 times so it is particularly good at triggering whatever the
problem is.

As I said, I have run the same code on a cubox (Marvell Armada 510 which
is a single core JP4 at 800MHz as far as I recall), and tail-f has run
it on a pandaboard (Quad Cortex-A9 as far as I know), and none of those
have ever had any of these segfaults.  I have seen it on every single
board we have of the dra7xx-evm boards.

Does anyone have a hint as to where to go next on tracking this down?

-- 
Len Sorensen