Random, rare, but reproducible segmentation faults

Thu Jul 9 06:49:44 EDT 2020

Hi Aurélien,

Le 7/8/20 à 4:28 PM, Aurelien Jarno a écrit :
> Hi all,
> 
> For some time I have seen random but relatively rare segmentation faults
> when building Debian packages, either in a QEMU virtual machine or an
> Hifive Unleashed board. I have recently been able to reproduce the issue
> by building Qt [1]. It usually fails after roughly one hour of build,
> not always on the same C++ file, but always on the same set of files.
> Trying to run make again the build is usually successful. When it
> happens one get the following error from GCC:
> 
> | g++: internal compiler error: Segmentation fault signal terminated program cc1plus
> | Please submit a full bug report,
> | with preprocessed source if appropriate.
> 
> And the following kernel error:
> 
> | [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 0x000000156888a430 in cc1plus[10000+e0e000]
> | [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
> | [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 0000003ffff3c870
> | [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 0000001569feb0d0
> | [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 000000000003b7c0
> | [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 000000000003b7c0
> | [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 000000000003b7c0
> | [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 73737350581f0402
> | [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 000000007fffffff
> | [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 0000003ffff3d638
> | [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 0000000000e9a9d0
> | [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 0000000000000001
> | [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
> | [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 cause: 000000000000000d
> 
> I have been able to bisect the issue and found it has been introduced by
> this commit:
> 
> | commit 54c95a11cc1b5e1d578209e027adf5775395dafd
> | Author: Alexandre Ghiti <alex at ghiti.fr>
> | Date:   Mon Sep 23 15:39:21 2019 -0700
> |
> |     riscv: make mmap allocation top-down by default
> |
> |     In order to avoid wasting user address space by using bottom-up mmap
> |     allocation scheme, prefer top-down scheme when possible.
> 
> Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
> a bit more, and found that the problem doesn't come from the top-down
> allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
> anything). However I have found it comes from the randomization, I mean
> that the following patch is enough to fix (or workaround?) the issue:
> 
> | --- a/mm/util.c
> | +++ b/mm/util.c
> | @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
> |  {
> |         unsigned long random_factor = 0UL;
> |
> | -       if (current->flags & PF_RANDOMIZE)
> | -               random_factor = arch_mmap_rnd();
> | +/*     if (current->flags & PF_RANDOMIZE)
> | +               random_factor = arch_mmap_rnd();*/
> |
> |         if (mmap_is_legacy(rlim_stack)) {
> |                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
> 
> I have also tried to play with vm.mmap_rnd_bits, but it seems that it
> doesn't have any real effect. 

That's too bad because this is the only riscv specific thing in this 
feature. And it is used by arm, arm64 and mips so we should be looking 
at something riscv specific here, but right now I don't know what.

I will take a look at that.

Thanks,

Alex

I however noticed that setting this value
> to 24 (the maximum) might move the heap next to the stack in some cases,
> although it seems unrelated to the original issue:
> 
> | $ cat /proc/self/maps
> | 340fde4000-340fe06000 rw-p 00000000 00:00 0
> | 340fe06000-340ff08000 r-xp 00000000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff08000-340ff09000 ---p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff09000-340ff0c000 r--p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff0c000-340ff0f000 rw-p 00105000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff0f000-340ff14000 rw-p 00000000 00:00 0
> | 340ff1d000-340ff1f000 r-xp 00000000 00:00 0                              [vdso]
> | 340ff1f000-340ff36000 r-xp 00000000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff36000-340ff37000 r--p 00016000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff37000-340ff38000 rw-p 00017000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff38000-340ff39000 rw-p 00000000 00:00 0
> | 3924505000-392450b000 r-xp 00000000 fe:01 1048640                        /bin/cat
> | 392450b000-392450c000 r--p 00005000 fe:01 1048640                        /bin/cat
> | 392450c000-392450d000 rw-p 00006000 fe:01 1048640                        /bin/cat
> | 3955e23000-3955e44000 rw-p 00000000 00:00 0                              [heap]
> | 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 0                              [stack]
> 
> To come back to the original issue, I don't know how to debug it
> further. I have already tried to get a core dump and analyze it with
> GDB, but the code triggering the failure is not part of the binary. Any
> hint or help would be welcomed.
> 
> Thanks,
> Aurelien
> 
> 
> [1] Technically the qtbase-opensource-src Debian package:
>      https://packages.debian.org/source/sid/qtbase-opensource-src
>