Random, rare, but reproducible segmentation faults

Wed Jul 8 16:28:44 EDT 2020

Hi all,

For some time I have seen random but relatively rare segmentation faults
when building Debian packages, either in a QEMU virtual machine or an
Hifive Unleashed board. I have recently been able to reproduce the issue
by building Qt [1]. It usually fails after roughly one hour of build,
not always on the same C++ file, but always on the same set of files.
Trying to run make again the build is usually successful. When it
happens one get the following error from GCC:

| g++: internal compiler error: Segmentation fault signal terminated program cc1plus
| Please submit a full bug report,
| with preprocessed source if appropriate.

And the following kernel error:

| [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 0x000000156888a430 in cc1plus[10000+e0e000]
| [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
| [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 0000003ffff3c870
| [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 0000001569feb0d0
| [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 000000000003b7c0
| [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 000000000003b7c0
| [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 000000000003b7c0
| [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 73737350581f0402
| [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 000000007fffffff
| [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 0000003ffff3d638
| [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 0000000000e9a9d0
| [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 0000000000000001
| [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
| [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 cause: 000000000000000d

I have been able to bisect the issue and found it has been introduced by
this commit:

| commit 54c95a11cc1b5e1d578209e027adf5775395dafd
| Author: Alexandre Ghiti <alex at ghiti.fr>
| Date:   Mon Sep 23 15:39:21 2019 -0700
|
|     riscv: make mmap allocation top-down by default
|
|     In order to avoid wasting user address space by using bottom-up mmap
|     allocation scheme, prefer top-down scheme when possible.

Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
a bit more, and found that the problem doesn't come from the top-down
allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
anything). However I have found it comes from the randomization, I mean
that the following patch is enough to fix (or workaround?) the issue:

| --- a/mm/util.c
| +++ b/mm/util.c
| @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
|  {
|         unsigned long random_factor = 0UL;
|  
| -       if (current->flags & PF_RANDOMIZE)
| -               random_factor = arch_mmap_rnd();
| +/*     if (current->flags & PF_RANDOMIZE)
| +               random_factor = arch_mmap_rnd();*/
|  
|         if (mmap_is_legacy(rlim_stack)) {
|                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;

I have also tried to play with vm.mmap_rnd_bits, but it seems that it
doesn't have any real effect. I however noticed that setting this value
to 24 (the maximum) might move the heap next to the stack in some cases,
although it seems unrelated to the original issue:

| $ cat /proc/self/maps
| 340fde4000-340fe06000 rw-p 00000000 00:00 0 
| 340fe06000-340ff08000 r-xp 00000000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff08000-340ff09000 ---p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff09000-340ff0c000 r--p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff0c000-340ff0f000 rw-p 00105000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff0f000-340ff14000 rw-p 00000000 00:00 0 
| 340ff1d000-340ff1f000 r-xp 00000000 00:00 0                              [vdso]
| 340ff1f000-340ff36000 r-xp 00000000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff36000-340ff37000 r--p 00016000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff37000-340ff38000 rw-p 00017000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff38000-340ff39000 rw-p 00000000 00:00 0 
| 3924505000-392450b000 r-xp 00000000 fe:01 1048640                        /bin/cat
| 392450b000-392450c000 r--p 00005000 fe:01 1048640                        /bin/cat
| 392450c000-392450d000 rw-p 00006000 fe:01 1048640                        /bin/cat
| 3955e23000-3955e44000 rw-p 00000000 00:00 0                              [heap]
| 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 0                              [stack]

To come back to the original issue, I don't know how to debug it
further. I have already tried to get a core dump and analyze it with
GDB, but the code triggering the failure is not part of the binary. Any
hint or help would be welcomed.

Thanks,
Aurelien

[1] Technically the qtbase-opensource-src Debian package:
    https://packages.debian.org/source/sid/qtbase-opensource-src

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien at aurel32.net                 http://www.aurel32.net