Random, rare, but reproducible segmentation faults
Alex Ghiti
alex at ghiti.fr
Fri Jul 10 16:37:41 EDT 2020
Hi Aurélien,
Le 7/10/20 à 3:12 PM, Aurelien Jarno a écrit :
> Hi Alex,
>
> On 2020-07-10 01:15, Alex Ghiti wrote:
>> I have a debian kernel downloaded from here
>> https://people.debian.org/~gio/dqib/ that runs using the following qemu
>> command:
>>
>> qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device
>> virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device
>> virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 -bios
>> ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel -initrd
>> initrd -object rng-random,filename=/dev/urandom,id=rng -device
>> virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"
>>
>> First is this kernel version ok to reproduce the bug ? Or should I download
>> another image ? I'd like to avoid having to rebuild the kernel myself if
>> possible.
>
> Yes, that should do it, it's running kernel 5.7.6 so enough to reproduce
> the issue. You just need to increase the memory a bit more (4 to 8GB)
> and add more CPU with for example -smp 4.
Ok thanks.
>
>> Now I would like to reproduce the bug: can you give me instructions on how
>> to compile the qt package ?
>
> The following sequence should allow you to build it:
> - sudo apt-get update
> - sudo apt-get install build-essential
> - sudo apt-get build-dep qtbase-opensource-src
> - apt-get source qtbase-opensource-src
> - cd qtbase-opensource-src-5.14.2+dfsg/
> - dpkg-buildpackage -B
>
> Alternatively I can prepare you an image with everything ready.
I hope my laptop will survive that :)
>
>> Is the page fault address always in the same area ? It might be interesting
>> to find some pattern in those addresses, maybe you could also print the
>> random offset to try to link both ?
>
> It seems really random to me, with 3 outliers:
> 0x0000003fe7ef3140
> 0x0000003fcd16cff0
> 0x0000003fb9e96170
> 0x0000003fd3f4a120
> 0x448173f67cdbc8b0
> 0x0000003fdfe093f0
> 0x0000003fdfe093f0
> 0x0000003fe1d4aa70
> 0x0000003fc2cfef90
> 0x0000003fc0f5d050
> 0x0000003fe1d879d0
> 0x0000003fe9d3e990
> 0xf0ef4585be2ae01f
> 0x00000034484f71b0
> 0x0000003fde30e960
> 0x000000156888a430
> 0x0000003eb8560936
> 0x0000003fb121a490
> 0x0000003fb9abddd0
> 0x0000003fe41fc5d0
Indeed, that's random. And the outliers are weird, at first sight I
would say the userspace program is responsible for that, but that
deserves more thinking.
>
>> Also print the entire virtual memory
>> mapping at the time of the fault (I don't know how to do that) to check what
>> the address is close to ?
>
> Yes, I'll try to find a way to do that.
>
In case you want to try:
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index ae7b7fe24658..32efe9e750d6 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -13,6 +13,7 @@
#include <linux/perf_event.h>
#include <linux/signal.h>
#include <linux/uaccess.h>
+#include <linux/dcache.h>
#include <asm/pgalloc.h>
#include <asm/ptrace.h>
@@ -166,6 +167,18 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
mmap_read_unlock(mm);
/* User mode accesses just cause a SIGSEGV */
if (user_mode(regs)) {
+ char filename[512], *f;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ f = filename;
+ if (vma->vm_file)
+ f = dentry_path_raw(vma->vm_file->f_path.dentry,
filename, 512);
+ else
+ strcpy(filename, "none");
+ pr_err("%px -> %px %s\n",
+ (void *)vma->vm_start, (void *)vma->vm_end, f);
+ }
+
do_trap(regs, SIGSEGV, code, addr);
return;
}
which will result in something like that:
# segfault
[ 44.297718] 0000000000010000 -> 0000000000011000 /usr/bin/segfault
[ 44.298067] 0000000000011000 -> 0000000000012000 /usr/bin/segfault
[ 44.298346] 0000000000012000 -> 0000000000013000 /usr/bin/segfault
[ 44.298623] 0000003fc89e8000 -> 0000003fc8b26000 /lib/libc-2.29.so
[ 44.298897] 0000003fc8b26000 -> 0000003fc8b27000 /lib/libc-2.29.so
[ 44.299171] 0000003fc8b27000 -> 0000003fc8b2b000 /lib/libc-2.29.so
[ 44.299444] 0000003fc8b2b000 -> 0000003fc8b2d000 /lib/libc-2.29.so
[ 44.299770] 0000003fc8b2d000 -> 0000003fc8b33000 none
[ 44.300000] 0000003fc8b33000 -> 0000003fc8b34000 none
[ 44.300225] 0000003fc8b34000 -> 0000003fc8b35000 none
[ 44.300454] 0000003fc8b35000 -> 0000003fc8b52000
/lib/ld-linux-riscv64-lp64.so.1
[ 44.300974] 0000003fc8b52000 -> 0000003fc8b53000
/lib/ld-linux-riscv64-lp64.so.1
[ 44.301323] 0000003fc8b53000 -> 0000003fc8b54000
/lib/ld-linux-riscv64-lp64.so.1
[ 44.301708] 0000003fc8b54000 -> 0000003fc8b55000 none
[ 44.301986] 0000003fffbe1000 -> 0000003fffc02000 none
[ 44.302684] segfault[123]: unhandled signal 11 code 0x1 at
0x00000000007b3238 in segfault[10000+1000]
[ 44.304009] CPU: 2 PID: 123 Comm: segfault Tainted: G D
5.8.0-rc4 #23
[ 44.304448] epc: 0000000000010450 ra : 0000003fc8a08250 sp :
0000003fffc01c30
>> The 0xd cause implies that the virtual address
>> does not exist at all, which is weird, my guess is that the randomization
>> "reveals" the bug but that the bug is still there once the randomization is
>> disabled.
>
> I have also that feeling. It could even be a userland issue, with the
> userland not able to cope with some memory mapping.
>
> Aurelien
>
Alex
More information about the linux-riscv
mailing list