Random, rare, but reproducible segmentation faults

Fri Jul 10 16:37:41 EDT 2020

Hi Aurélien,

Le 7/10/20 à 3:12 PM, Aurelien Jarno a écrit :
> Hi Alex,
> 
> On 2020-07-10 01:15, Alex Ghiti wrote:
>> I have a debian kernel downloaded from here
>> https://people.debian.org/~gio/dqib/ that runs using the following qemu
>> command:
>>
>> qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device
>> virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device
>> virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 -bios
>> ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel -initrd
>> initrd -object rng-random,filename=/dev/urandom,id=rng -device
>> virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"
>>
>> First is this kernel version ok to reproduce the bug ? Or should I download
>> another image ? I'd like to avoid having to rebuild the kernel myself if
>> possible.
> 
> Yes, that should do it, it's running kernel 5.7.6 so enough to reproduce
> the issue. You just need to increase the memory a bit more (4 to 8GB)
> and add more CPU with for example -smp 4.

Ok thanks.

> 
>> Now I would like to reproduce the bug: can you give me instructions on how
>> to compile the qt package ?
> 
> The following sequence should allow you to build it:
> - sudo apt-get update
> - sudo apt-get install build-essential
> - sudo apt-get build-dep qtbase-opensource-src
> - apt-get source qtbase-opensource-src
> - cd qtbase-opensource-src-5.14.2+dfsg/
> - dpkg-buildpackage -B
> 
> Alternatively I can prepare you an image with everything ready.

I hope my laptop will survive that :)

> 
>> Is the page fault address always in the same area ? It might be interesting
>> to find some pattern in those addresses, maybe you could also print the
>> random offset to try to link both ?
> 
> It seems really random to me, with 3 outliers:
> 0x0000003fe7ef3140
> 0x0000003fcd16cff0
> 0x0000003fb9e96170
> 0x0000003fd3f4a120
> 0x448173f67cdbc8b0
> 0x0000003fdfe093f0
> 0x0000003fdfe093f0
> 0x0000003fe1d4aa70
> 0x0000003fc2cfef90
> 0x0000003fc0f5d050
> 0x0000003fe1d879d0
> 0x0000003fe9d3e990
> 0xf0ef4585be2ae01f
> 0x00000034484f71b0
> 0x0000003fde30e960
> 0x000000156888a430
> 0x0000003eb8560936
> 0x0000003fb121a490
> 0x0000003fb9abddd0
> 0x0000003fe41fc5d0

Indeed, that's random. And the outliers are weird, at first sight I 
would say the userspace program is responsible for that, but that 
deserves more thinking.

> 
>> Also print the entire virtual memory
>> mapping at the time of the fault (I don't know how to do that) to check what
>> the address is close to ?
> 
> Yes, I'll try to find a way to do that.
> 

In case you want to try:

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c 

index ae7b7fe24658..32efe9e750d6 100644 

--- a/arch/riscv/mm/fault.c 

+++ b/arch/riscv/mm/fault.c 

@@ -13,6 +13,7 @@ 

  #include <linux/perf_event.h> 

  #include <linux/signal.h> 

  #include <linux/uaccess.h> 

+#include <linux/dcache.h> 

 

  #include <asm/pgalloc.h> 

  #include <asm/ptrace.h> 

@@ -166,6 +167,18 @@ asmlinkage void do_page_fault(struct pt_regs *regs) 

         mmap_read_unlock(mm); 

         /* User mode accesses just cause a SIGSEGV */ 

         if (user_mode(regs)) { 

+        char filename[512], *f; 

+ 

+        for (vma = mm->mmap; vma; vma = vma->vm_next) { 

+            f = filename; 

+            if (vma->vm_file) 

+                f = dentry_path_raw(vma->vm_file->f_path.dentry, 
filename, 512);
+            else 

+                strcpy(filename, "none"); 

+            pr_err("%px -> %px %s\n", 

+                    (void *)vma->vm_start, (void *)vma->vm_end, f); 

+        } 

+ 

                 do_trap(regs, SIGSEGV, code, addr); 

                 return; 

         }

which will result in something like that:
  # segfault
[   44.297718] 0000000000010000 -> 0000000000011000 /usr/bin/segfault
[   44.298067] 0000000000011000 -> 0000000000012000 /usr/bin/segfault
[   44.298346] 0000000000012000 -> 0000000000013000 /usr/bin/segfault
[   44.298623] 0000003fc89e8000 -> 0000003fc8b26000 /lib/libc-2.29.so
[   44.298897] 0000003fc8b26000 -> 0000003fc8b27000 /lib/libc-2.29.so
[   44.299171] 0000003fc8b27000 -> 0000003fc8b2b000 /lib/libc-2.29.so
[   44.299444] 0000003fc8b2b000 -> 0000003fc8b2d000 /lib/libc-2.29.so
[   44.299770] 0000003fc8b2d000 -> 0000003fc8b33000 none
[   44.300000] 0000003fc8b33000 -> 0000003fc8b34000 none
[   44.300225] 0000003fc8b34000 -> 0000003fc8b35000 none
[   44.300454] 0000003fc8b35000 -> 0000003fc8b52000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.300974] 0000003fc8b52000 -> 0000003fc8b53000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.301323] 0000003fc8b53000 -> 0000003fc8b54000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.301708] 0000003fc8b54000 -> 0000003fc8b55000 none
[   44.301986] 0000003fffbe1000 -> 0000003fffc02000 none
[   44.302684] segfault[123]: unhandled signal 11 code 0x1 at 
0x00000000007b3238 in segfault[10000+1000]
[   44.304009] CPU: 2 PID: 123 Comm: segfault Tainted: G      D 
  5.8.0-rc4 #23
[   44.304448] epc: 0000000000010450 ra : 0000003fc8a08250 sp : 
0000003fffc01c30

>> The 0xd cause implies that the virtual address
>> does not exist at all, which is weird, my guess is that the randomization
>> "reveals" the bug but that the bug is still there once the randomization is
>> disabled.
> 
> I have also that feeling. It could even be a userland issue, with the
> userland not able to cope with some memory mapping.
> 
> Aurelien
> 

Alex