Occational random segfault on Cortex-A15 when exiting SIGCHLD handler.

Wed Mar 19 18:53:43 EDT 2014

On Tue, Mar 18, 2014 at 09:32:51PM -0400, Lennart Sorensen wrote:
> I have been trying to track down the cause of some random segfaults for
> the last couple of weeks.  They were mostly showing up in one particular
> program (confdc which is part of confd from tail-f, but is essentially
> just the erlang 14 VM with some erlang modules running and the segfault
> is happening in the main erlang code).  I have also seen occasional
> segfaults in gcc and even gdb.
> 
> The system is a dra7xx-evm from TI (so an eval board for
> the next OMAP5 chip), running a 3.8.13 based kernel from
> git://git.omapzoom.org/kernel/omap.git (they are working on mainlining
> the patches, but that's not done yet as far as I can tell) branch
> p-ti-linux-3.8.y
> 
> The CPU has dual Cortex-A15 at 1.5GHz.  Core revision is r2p2.
> 
> The userspace is Debian Wheezy armhf.  Running the exact same code
> (except the kernel) on a cubox never segfaults.  I initially wondered if
> the hardware had a problem, but the failures seem too consistent to be
> a random hardware failure, and nothing else ever seems to get corrupted.
> 
> The gdb backtraces from failures so far all indicated that there is a
> segfault when exiting the SIGCHLD handler, which given it is happening
> in multiple different programs seems like it can't just be a coincidence.
> 
> Some example traces:
> 
> gdb dying after running it a few thousand times with the same arguments
> and not failing:
> 
> (wheezydev)root at omap5:~/rmf# gdb `which gdb` ./core
> GNU gdb (GDB) 7.4.1-debian
> Copyright (C) 2012 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "arm-linux-gnueabihf".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /usr/bin/gdb...done.
> [New LWP 11964]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
> Core was generated by `gdb /usr/lib/confd/bin/confd mibs/core'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000 in ?? ()
> (gdb) where
> #0  0x00000000 in ?? ()
> #1  0x00095a1e in sigchld_handler (signo=<error reading variable: Cannot access memory at address 0xb6bddc34>) at /root/gdb-7.4.1+dfsg/gdb/linux-nat.c:5530
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb)
> 
> The code in question there is:
> 
> 5511 /* SIGCHLD handler that serves two purposes: In non-stop/async mode,
> 5512    so we notice when any child changes state, and notify the
> 5513    event-loop; it allows us to use sigsuspend in linux_nat_wait_1
> 5514    above to wait for the arrival of a SIGCHLD.  */
> 5515
> 5516 static void
> 5517 sigchld_handler (int signo)
> 5518 {
> 5519   int old_errno = errno;
> 5520
> 5521   if (debug_linux_nat)
> 5522     ui_file_write_async_safe (gdb_stdlog,
> 5523                               "sigchld\n", sizeof ("sigchld\n") - 1);
> 5524
> 5525   if (signo == SIGCHLD
> 5526       && linux_nat_event_pipe[0] != -1)
> 5527     async_file_mark (); /* Let the event loop know that there are
> 5528                            events to handle.  */
> 5529
> 5530   errno = old_errno;
> 5531 }
> 
> And from the erlang/confd case:
> 
> (wheezydev)root at omap5:~/rmf# gdb /usr/lib/confd/bin/confd mibs/core
> GNU gdb (GDB) 7.4.1-debian
> Copyright (C) 2012 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "arm-linux-gnueabihf".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /usr/lib/confd/bin/confd...done.
> [New LWP 29581]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
> Core was generated by `/usr/lib/confd/bin/confd -B -- -root /usr/lib/confd -progname confd -- -home /r'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000 in ?? ()
> (gdb) where
> #0  0x00000000 in ?? ()
> #1  0x000b95ea in onchld (signum=<optimized out>) at sys/unix/sys.c:1150
> #2  <signal handler called>
> #3  0x000efd4c in __udivsi3 ()
> #4  0x000efdf0 in __aeabi_uidivmod ()
> #5  0x000efdf0 in __aeabi_uidivmod ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> The code there is:
> 1138 #if (defined(SIG_SIGSET) || defined(SIG_SIGNAL))
> 1139 static RETSIGTYPE onchld(void)
> 1140 #else
> 1141 static RETSIGTYPE onchld(int signum)
> 1142 #endif
> 1143 {
> 1144 #if CHLDWTHR
> 1145     ASSERT(0); /* We should *never* catch a SIGCHLD signal */
> 1146 #elif defined(ERTS_SMP)
> 1147     smp_sig_notify('C');
> 1148 #else
> 1149     children_died = 1;
> 1150     ERTS_CHK_IO_INTR(1); /* Make sure we don't sleep in poll */
> 1151 #endif
> 1152 }
> 
> CHLDWTHR and ERTS_SMP are NOT defined/true in this case, so the last
> lines are the ones in use.
> 
> I have tried to use strace on the process, and every call is exactly
> the same between a working run and a segfaulting run, up until SIGCHLD
> is received, and then on some runs it segfaults without making any
> syscalls first.
> 
> I have tried slowing the CPU from 1.5GHz to 1.0GHz, which made no difference.
> 
> I tried running with maxcpus=1, which also made no difference.
> 
> I tried upgrading libc from 2.13 Debian wheezy) to 2.18 (Debian Jessie,
> as well as the libgcc as well, and it made no difference either.
> 
> gdb seems to segfault maybe 1 in 1000 times.  confd segfaults more like
> 1 in 20 times so it is particularly good at triggering whatever the
> problem is.
> 
> As I said, I have run the same code on a cubox (Marvell Armada 510 which
> is a single core JP4 at 800MHz as far as I recall), and tail-f has run
> it on a pandaboard (Quad Cortex-A9 as far as I know), and none of those
> have ever had any of these segfaults.  I have seen it on every single
> board we have of the dra7xx-evm boards.
> 
> Does anyone have a hint as to where to go next on tracking this down?

Hmm, so I tried Linus's latest tree
(4907cdca7210c5895311bddcf05a4c85b67d8566), and while it is missing a
number of the dra7xx drivers (like ethernet, which is a bit inconvinient),
it does boot.

So far my test cases that were failing with 3.8.13-ti kernel, are not
failing.  It has been running the test that failed the most often for
2 hours straight without a segfault so far.  It usually lasted about
30 seconds with 3.8.13-ti.

So now I am trying to determine if the problem is something solved
between 3.8 and 3.14, or if one of the not yet merged drivers from ti
is causing the problem.  At least I have hope that this problem is
going to go away.  Too bad I also need xenomai soon, which doesn't seem
to be quite at this level yet (not that I can't try to port the ipipe
patch myself, having done it before on powerpc).

Any guesses as to what change between 3.8 and 3.14 could have anything
to do with these segfaults?

-- 
Len Sorensen