[PATCH] arm: port KCOV to arm

Fri Apr 27 06:06:43 PDT 2018

On Thu, Apr 26, 2018 at 05:04:09PM +0200, Dmitry Vyukov wrote:
> On Thu, Apr 26, 2018 at 4:58 PM, Dmitry Vyukov <dvyukov at google.com> wrote:
> >>> > On Thu, Apr 26, 2018 at 03:08:46PM +0200, Dmitry Vyukov wrote:

> >>> >> +# Instrumenting fault.c causes infinite recursion between:
> >>> >> +# __dabt_svc -> do_DataAbort -> __sanitizer_cov_trace_pc -> __dabt_svc
> >>> >> +KCOV_INSTRUMENT_fault.o := n
> >>> >
> >>> > Why does __sanitizer_cov_trace_pc() cause a data abort?
> >>> >
> >>> > We don't seem to have this issue on arm64, where our fault handling is
> >>> > instrumented, so this seems suspect.
> >>>
> >>> I don't have an explanation. That's just what me and Takuo observed.
> >>> We've seen that it happens when __sanitizer_cov_trace_pc tries to
> >>> dereference current to check kcov mode.
> >>
> >> Huh. The only reason I can imagine that might happen is if the
> >> compiler's generating a misaligned access requiring fixup. If your
> >> compiler's doing that, it could presumably do that in the fault handling
> >> code too, which would be a big problem.
> >>
> >> If you happen to have a binary around, can you dump the disassembly for
> >> your __sanitizer_cov_trace_pc?
> >>
> >> Using the Linaro 17.05 arm-linux-gnueabhif-gcc 6.3 toolchain I get the
> >> following:
> >>
> >> 00000000 <__sanitizer_cov_trace_pc>:
> >>    0:   e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
> >>    4:   e1a0300d        mov     r3, sp
> >>    8:   e3c33d7f        bic     r3, r3, #8128   ; 0x1fc0
> >>    c:   e3a02c01        mov     r2, #256        ; 0x100
> >>   10:   e3c3303f        bic     r3, r3, #63     ; 0x3f
> >>   14:   e340201f        movt    r2, #31
> >>   18:   e5931004        ldr     r1, [r3, #4]
> >>   1c:   e1110002        tst     r1, r2
> >>   20:   149df004        popne   {pc}            ; (ldrne pc, [sp], #4)
> >>   24:   e593300c        ldr     r3, [r3, #12]
> >>   28:   e5932508        ldr     r2, [r3, #1288] ; 0x508
> >>   2c:   e3520002        cmp     r2, #2
> >>   30:   149df004        popne   {pc}            ; (ldrne pc, [sp], #4)
> >>   34:   e5932510        ldr     r2, [r3, #1296] ; 0x510
> >>   38:   e593150c        ldr     r1, [r3, #1292] ; 0x50c
> >>   3c:   e5923000        ldr     r3, [r2]
> >>   40:   e2833001        add     r3, r3, #1
> >>   44:   e1530001        cmp     r3, r1
> >>   48:   3782e103        strcc   lr, [r2, r3, lsl #2]
> >>   4c:   35823000        strcc   r3, [r2]
> >>   50:   e49df004        pop     {pc}            ; (ldr pc, [sp], #4)
> >>
> >> ... which looks sane/safe to me.
> >
> > Here is my disasm:
> >
> > 801dc1b0 <__sanitizer_cov_trace_pc>:
> > 801dc1b0:       e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
> > 801dc1b4:       e1a0300d        mov     r3, sp
> > 801dc1b8:       e3c33d7f        bic     r3, r3, #8128   ; 0x1fc0
> > 801dc1bc:       e3a02c01        mov     r2, #256        ; 0x100
> > 801dc1c0:       e3c3303f        bic     r3, r3, #63     ; 0x3f
> > 801dc1c4:       e340201f        movt    r2, #31
> > 801dc1c8:       e5931004        ldr     r1, [r3, #4]
> > 801dc1cc:       e1110002        tst     r1, r2
> > 801dc1d0:       149df004        popne   {pc}            ; (ldrne pc, [sp], #4)
> > 801dc1d4:       e593300c        ldr     r3, [r3, #12]
> > 801dc1d8:       e5932be0        ldr     r2, [r3, #3040] ; 0xbe0
> > 801dc1dc:       e3520002        cmp     r2, #2
> > 801dc1e0:       149df004        popne   {pc}            ; (ldrne pc, [sp], #4)
> > 801dc1e4:       e5932be8        ldr     r2, [r3, #3048] ; 0xbe8
> > 801dc1e8:       e5931be4        ldr     r1, [r3, #3044] ; 0xbe4

These offsets for task_struct::{kcov_area,kcov_size} are *much* larger
than mine. Can you share your kernel config?

> > 801dc1ec:       e5923000        ldr     r3, [r2]
> > 801dc1f0:       e2833001        add     r3, r3, #1
> > 801dc1f4:       e1510003        cmp     r1, r3
> > 801dc1f8:       8782e103        strhi   lr, [r2, r3, lsl #2]
> > 801dc1fc:       85823000        strhi   r3, [r2]
> > 801dc200:       e49df004        pop     {pc}            ; (ldr pc, [sp], #4)
> >
> > Compiler is gcc version 7.2.0 (Debian 7.2.0-7).

I also tried with the Linaro 17.11 GCC 7.2.1, and see codegen
to yours above, modulo the task_struct offsets.

> > I've now rebuilt without that change and will hopefully soon get
> > crashes to reconfirm.

Just to check, do you see this when starting userspace? i.e. without
opening any kcov files?

I can't reproduce the issue on real hardware atop of v4.17-rc2, when
booting and running a standard ARMv7 buildroot userspace. So the kcov
mode check seems fine to me.

> Yes, a swarm of assorted crashes now. Here are 4:
> 
> buildroot login: Unable to handle kernel paging request at virtual
> address c9db963e
> pgd = c188b8a2
> [c9db963e] *pgd=00000000
> Internal error: Oops: 80000005 [#1] SMP ARM
> Modules linked in:
> CPU: 0 PID: 933 Comm: syz-executor3 Not tainted 4.17.0-rc2+ #4
> Hardware name: ARM-Versatile Express
> PC is at 0xc9db963e

That PC is the faulting address, which doesn't look like a valid kernel
image address given it's ~1G above the valid LR value down at
0x8010e290.

> LR is at do_work_pending+0xcc/0xf0

Assuming your GCC's codegen is the same as mine, that's the LR set up by
the call to task_work_run(), immediately before we branch back to the
start of the loop. So either we blew up in task_work_run(), or we've
returned to the top of the loop.

At the top of the loop my GCC has a bl to __sanitizer_cov_trace_pc(),
which should setup the LR.

My task_work_run() doesn't tail-call to anything, so I don't currently
see how we could end up in this state. That could be down to text
corruption, or corruption of the state of an interrupted context.

If you don't already have STRICT_KERNEL_RWX enabled, could you try
turning it on?

Thanks,
Mark.