Random, rare, but reproducible segmentation faults

Fri Jul 10 01:15:50 EDT 2020

Hi Aurélien,

Le 7/9/20 à 6:49 AM, Alex Ghiti a écrit :
> Hi Aurélien,
> 
> Le 7/8/20 à 4:28 PM, Aurelien Jarno a écrit :
>> Hi all,
>>
>> For some time I have seen random but relatively rare segmentation faults
>> when building Debian packages, either in a QEMU virtual machine or an
>> Hifive Unleashed board. I have recently been able to reproduce the issue
>> by building Qt [1]. It usually fails after roughly one hour of build,
>> not always on the same C++ file, but always on the same set of files.
>> Trying to run make again the build is usually successful. When it
>> happens one get the following error from GCC:
>>
>> | g++: internal compiler error: Segmentation fault signal terminated 
>> program cc1plus
>> | Please submit a full bug report,
>> | with preprocessed source if appropriate.
>>
>> And the following kernel error:
>>
>> | [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 
>> 0x000000156888a430 in cc1plus[10000+e0e000]
>> | [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
>> | [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 
>> 0000003ffff3c870
>> | [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 
>> 0000001569feb0d0
>> | [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 
>> 000000000003b7c0
>> | [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 
>> 000000000003b7c0
>> | [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 
>> 000000000003b7c0
>> | [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 
>> 73737350581f0402
>> | [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 
>> 000000007fffffff
>> | [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 
>> 0000003ffff3d638
>> | [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 
>> 0000000000e9a9d0
>> | [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 
>> 0000000000000001
>> | [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
>> | [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 
>> cause: 000000000000000d
>>
>> I have been able to bisect the issue and found it has been introduced by
>> this commit:
>>
>> | commit 54c95a11cc1b5e1d578209e027adf5775395dafd
>> | Author: Alexandre Ghiti <alex at ghiti.fr>
>> | Date:   Mon Sep 23 15:39:21 2019 -0700
>> |
>> |     riscv: make mmap allocation top-down by default
>> |
>> |     In order to avoid wasting user address space by using bottom-up 
>> mmap
>> |     allocation scheme, prefer top-down scheme when possible.
>>
>> Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
>> a bit more, and found that the problem doesn't come from the top-down
>> allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
>> anything). However I have found it comes from the randomization, I mean
>> that the following patch is enough to fix (or workaround?) the issue:
>>
>> | --- a/mm/util.c
>> | +++ b/mm/util.c
>> | @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, 
>> struct rlimit *rlim_stack)
>> |  {
>> |         unsigned long random_factor = 0UL;
>> |
>> | -       if (current->flags & PF_RANDOMIZE)
>> | -               random_factor = arch_mmap_rnd();
>> | +/*     if (current->flags & PF_RANDOMIZE)
>> | +               random_factor = arch_mmap_rnd();*/
>> |
>> |         if (mmap_is_legacy(rlim_stack)) {
>> |                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
>>
>> I have also tried to play with vm.mmap_rnd_bits, but it seems that it
>> doesn't have any real effect. 
> 
> That's too bad because this is the only riscv specific thing in this 
> feature. And it is used by arm, arm64 and mips so we should be looking 
> at something riscv specific here, but right now I don't know what.
> 
> I will take a look at that.
> 
> Thanks,
> 
> Alex
> 
> I however noticed that setting this value
>> to 24 (the maximum) might move the heap next to the stack in some cases,
>> although it seems unrelated to the original issue:
>>
>> | $ cat /proc/self/maps
>> | 340fde4000-340fe06000 rw-p 00000000 00:00 0
>> | 340fe06000-340ff08000 r-xp 00000000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff08000-340ff09000 ---p 00102000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff09000-340ff0c000 r--p 00102000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff0c000-340ff0f000 rw-p 00105000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff0f000-340ff14000 rw-p 00000000 00:00 0
>> | 340ff1d000-340ff1f000 r-xp 00000000 00:00 
>> 0                              [vdso]
>> | 340ff1f000-340ff36000 r-xp 00000000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff36000-340ff37000 r--p 00016000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff37000-340ff38000 rw-p 00017000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff38000-340ff39000 rw-p 00000000 00:00 0
>> | 3924505000-392450b000 r-xp 00000000 fe:01 
>> 1048640                        /bin/cat
>> | 392450b000-392450c000 r--p 00005000 fe:01 
>> 1048640                        /bin/cat
>> | 392450c000-392450d000 rw-p 00006000 fe:01 
>> 1048640                        /bin/cat
>> | 3955e23000-3955e44000 rw-p 00000000 00:00 
>> 0                              [heap]
>> | 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 
>> 0                              [stack]
>>
>> To come back to the original issue, I don't know how to debug it
>> further. I have already tried to get a core dump and analyze it with
>> GDB, but the code triggering the failure is not part of the binary. Any
>> hint or help would be welcomed.
>>
>> Thanks,
>> Aurelien
>>
>>
>> [1] Technically the qtbase-opensource-src Debian package:
>>      https://packages.debian.org/source/sid/qtbase-opensource-src
>>
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I have a debian kernel downloaded from here 
https://people.debian.org/~gio/dqib/ that runs using the following qemu 
command:

qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device 
virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device 
virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 
-bios ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel 
-initrd initrd -object rng-random,filename=/dev/urandom,id=rng -device 
virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"

First is this kernel version ok to reproduce the bug ? Or should I 
download another image ? I'd like to avoid having to rebuild the kernel 
myself if possible.

Now I would like to reproduce the bug: can you give me instructions on 
how to compile the qt package ?

Is the page fault address always in the same area ? It might be 
interesting to find some pattern in those addresses, maybe you could 
also print the random offset to try to link both ? Also print the entire 
virtual memory mapping at the time of the fault (I don't know how to do 
that) to check what the address is close to ? The 0xd cause implies that 
the virtual address does not exist at all, which is weird, my guess is 
that the randomization "reveals" the bug but that the bug is still there 
once the randomization is disabled.

I'm sorry I don't have much more to propose here :(

Alex