[RFC] stack tracer on arm64

Fri Sep 11 13:26:51 PDT 2015

On Fri, 11 Sep 2015 21:38:40 +0900
AKASHI Takahiro <takahiro.akashi at linaro.org> wrote:

> Junseok's bug report[1] about stack tracer on arm64 has exposed additional
> issues to me. I tried to fix them for the last few weeks, and submitted
> several RFC/patches.  But the problems seem to be more complicated than
> I first imagined.
> 
> So, before taking further steps, I'd like you to have time to read
> the following document and give me your thoughts on it.
> 
> Some of my ideas, particularly a function prologue analyzer, are difficult
> to implement, and may have no guarantee to handle all the cases correctly.
> Some may have impact on code size (and extra instructions at runtime).
> I think that we should find out the balance of accuracy/conveniences of
> this tool (stack tracer) and added complexities/open issues.
> 
> -Takahiro AKASHI
> 
> Table of Contents
> =================
> I.  basic flow of stack tracer
> 
> II. issues on the naive implementation

Yes, the stack tracer is currently a naive implementation. I haven't
had time to make it more robust. Maybe you can do it for me :-)

> 
> III.my approach and open issues
> 
> 
> I.  basic flow of stack tracer
> ------------------------------
> Every C function invokes a ftrace hook routine, _mcount or ftrace_caller,
> in its function prologue.
> 
>     step-1.If the current stack pointer indicates that we have a new record
>            of maximum stack usage, then do the followings.
>     step-2.call arch-specific save_stack_trace() to collect a list of functions
>            in current call chains (we name it a stack dump list here.)
>     step-3.For each function in a stack dump list, compare it against
>            every 4bytes of integer on the stack data.
>     step-4.If matched, we recognize the matched location in the stack as
>            a current stack pointer for this function
>            (for example, idxB for func_B:pcB)
>     step-5.calculate a "stack index" for this function as
>                <top address of stack> - <stack pointer>
>            and record the value into dump_stack_index[].
> 
> LOW     |        |
> rspA    |        |
>          |  ... <----------- static/dynamic local variables in func_A
>          |        |
> rbpA    +--------+   func_A (pcA, rbpA, rspA)
>          |  rbpB  |
>     idxB +--------+
>          |  pcB   |
>          |  ... <----------- extra function args to func_A
> rspB    + - - - -+
>          |        |
>          |  ... <----------- static/dynamic local variables in func_B
>          |        |
> rbpB    +--------+   func_B (pcB, rbpB, rspB)
>          |  rbpC  |
>     idxC +--------+
>          |  pcC   |
>          |  ... <----------- extra function args to func_B
> rspC    + - - - -+
>          |        |
>          |  ... <----------- static/dynamic local variables in func_C
>          |        |
> rbpC    +--------+   func_C (pcC, rbpC, rspC)
>          |        |
>          |  ...   |
> top     + - - - -+
> HIGH
> 	* In the figure above, rbp(frame pointer) and rsp(stack
>            pointer) point to respective addresses at the time after
>            a function prologue has been done but before a callee function
>            is called.
> 
> Later on, "cat /sys/kernel/debug/tracing/stack_trace" shows the latest list
> of stack dump with sizes for each function.
>      - stack size for func_B, for example, is calculated as
>             <stack size> = (top - idxB) - (top - idxC)
> 
> 
> II. issues on the naive implementation
> --------------------------------------
> By "naive implementation," I mean how the mainline kernel without any
> modification behaves with ftrace-based stack tracer.
> (Currently, there is no arch-specific code on arm64 regarding stack tracer.)

One thing I would do is to make a weak function that implements the
naive solution, and allow archs to enhance the code.

> 
> 1) slurping stack
> 
>     Assume that a given function A was invoked, and it was invoked again in
>     another context, then it called another function B which allocated a large
>     length of local variable on the stack, but it has not modified the variable
>     yet.
>     When stack tracer, in step-3, examines the stack looking for B, then A,
>     we may have a chance to accidentally find a staled, not current, stack
>     frame for A because the old frame might reside on the memory for
>     the variable which has not been overwritten.
> 
>      a. The stack_trace output may have staled entries.

Yeah, and implentations of save_stack_trace without FRAME_POINTERS
saves the stack trace with the same issue. That is, what's returned is
just a function found on the stack, which may just be stale data.

> 
> 2) differences between x86 and arm64
> 
>     On x86, "call" instruction automatically pushes a return address on
>     the top of the stack and decrements a stack pointer. Then child function
>     allocates its local variables on the stack.
> 
>     On arm64, a child function decreases reserve memory for local variables,
>     pushes a return address (LR) and old frame pointer then updates a frame
>     pointer to stack pointer.

ug!

> 
>     So step-3/4 seems OK on x86, but on arm64, an estimated stack pointer,
>     idxB in the picture below, is not appropriate for our purpose because it
>     contains child function's "local variables."
>     We should instead use spB, if possible, for better result.
> 
>      a. Stack size for a function in stack_trace output is inaccurate,
>         or rather wrong.
>         (It seems as if <Size> field is one-line offset against <Location>.)
> 
> LOW     |  ...   |
> fpA     +--------+   func_A (pcA, fpA, spA)
>          |  fpB   |
>     idxB + - - - -+
>          |  pcB   |
>          |  ... <----------- static local variables in func_A
>          |  ...   |             and extra function args to func_A
> spB     + - - - -+
>          |  ... <----------- dynamically allocated variables in func_B
> fpB     +--------+   func_B (pcB, fpB, spB)
>          |  fpC   |
>     idxC + - - - -+
>          |  pcC   |
>          |  ... <----------- static local variables in func_B
>          |  ...   |             and extra function args to func_B
> spC     + - - - -+
>          |  ...   |
> fpC     +--------+   func_C (pcC, fpC, spC)
> HIGH    |        |
> 
> 3) Interrupted frame

Note, x86 uses a different stack for interrupts. We don't even trace
stack frames for interrupts on x86. That's because object_is_on_stack()
fails.

>     We miss one or two functions when an interrupt occurs.
>     Assume that func_D calls func_C, func_C calls func_B, func_B is
>     interrupted by a device, el1_irq calls func_A, and so on.
> 
>       a.Since el1_irq (and other exception vectors) doesn't update a frame
>         pointer (x29), the current wind_stackframe() tracks only
>             func_A -> el1_irq -> func_C -> func_D
>         and always misses func_B in a stack dump list.
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpB   |
>          |  pcI   |
>          |  ...   |
> spI     +--------+   el1_irq (pcI, fpB, spI)
>          |        |
>          |  ...   |
>          |        |
>          |  fpB   |
>          |  spB   |
>          |  pcB   |
>          |  ...   |
> fpB,spB +--------+   func_B (pcB, fpB, spB)
>          |  fpC   |
>          |  pcC   |
>          |  ...   |
> fpC,spC +--------+   func_C (pcC, fpC, spC)
>          |  fpD   |
>          |  pcD   |
>          |  ...   |
> fpD,spD +--------+   func_D (pcD, fpD, spD)
> HIGH    |        |
> 
>       b.Even worse, if an interrupt occurs in a function prolog of func_B,
>         a stack frame can be uncompleted and a frame pointer may still
>         points to one of func_C (fpC).
>         In this case, we also miss func_C in a stack dump list.
>       c.As a result, stack sizes for functions around el1_irq are wrong.
> 
> 4) functions under function_graph tracer
> 
>     To catch an event of function return, function_graph tracer rewrites
>     LR register saved in a stack frame to a hook function, return_to_handler.
>     An original value is saved in current->ret_stack[] on entry and restored
>     later on exit by ftrace_return_to_handler() called in return_to_handler.
> 
>      a. With function_graph tracer on, unwind_stackframe() will return lots of
>         bogus entries, "return_to_handler-0x4."
> 
>     Apart from stack tracer, fixing this issue might make more sense, say,
>     for panic cases because panic_stack_dump() also uses unwind_frame().

I did a git grep for panic_stack_dump and it came up with zipo. Is that
something new?

Note, in x86, the stack dump adds in the real functions, along with the
return_to_handler. Hmm, I haven't really looked too see how that
affects stack tracing on x86.

> 
> 5) leaf function
> 
>     gcc (on arm64) will, by default, optimize a leaf function by removing
>     a stack frame. This causes several issues:
> 
>      a.we always miss a case of maximum stack size whenever a leaf function
>        is at the top of call chains because a stack tracer code will never
>        be executed against it.
>      b.When an interrupt occurs during a leaf function (func_B), we miss its
>        parent function (func_C) in a stack dump list because func_B has no
>        stack frame and elr_el1 still points to func_C's frame.
>        (This is a variant case as described in "interrupted frame" section.)
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpB   |
>          |  pcI   |
>          |  ...   |
> spI     +--------+   el1_irq (pcI, fpC, spI)
>          (pt_regs)|
>          |  ...   |
>          |  fpC   |
>          |  spB   |
>          |  pcB   |
> spB     +--------+   func_B (pcB, fpC, spB)
>          |  ...   |
> fpC,spC +--------+   func_C (pcC, fpC, spC)
> HIGH    |        |
> 

I think on x86, when you enable -pg on gcc, it basically makes all leaf
functions act like normal functions. I take it, that's not the same on
arm64.

> 6) KVM
>     (T.B.D.)

thank god.

> 
> 
> III.my approach and open issues
> -------------------------------
> 
> 1) slurping stack, and
> 2) differences between x86 and arm64
> 
>     Instead of slurping stack data like at step-3/4, we extend save_stack_trace()
>     /wind_stackframe() to collect a list of stack pointers for all call chains.
>     Since arm64 ABI does not save a stack pointer in a stack frame, we try to
>     calculate that value from callee's stack frame.
> 
>     For example, in the following case, a value of stack point when func_B
>     calls func_A is calculated as:
> 
>          spB = fpA + (XX -- YY)
> 
>     where XX and YY are determined by a "function prologue analyzer" described
>     in "interrupted frame" section.
> 
> LOW     |  ...   |
> fpA     +--------+   func_A (pcA, fpA, spA)
>          |  fpB   |
>          |  pcB   |
>          |  ... <----------- static local variables in func_A
>          |  ...   |             and extra function args to func_A
> spB     + - - - -+
>          |  ... <----------- dynamically allocated variables in func_B
> fpB     +--------+   func_B (pcB, fpB, spB)
>          |  fpC   |
>          |  pcC   |
>          |  ... <----------- static local variables in func_B
>          |  ...   |             and extra function args to func_B
> spC     + - - - -+
>          |  ...   |
> fpC     +--------+   func_C (pcC, fpC, spC)
> HIGH    |        |
> 
>          open issue-0: This fix also depends on function prologue analyzer
>                   being available.
> 
> 3) interrupted frame
> 
>     First of all, we should update a frame pointer even in exception vector
>     code so that unwind_frame() can dereference stack frames.
>     Since exception handlers place data of 'struct pt_regs' instead of
>     'struct stackframe' in *normal* functions, we will take special
>     care of this case.
> 
> 	open issue-1: should we use the same form of stack frame here?

Not sure I understand the question.

> 
>     This way, we will see an additional entry, func_B in the figure above,
>     in a stack dump list but we still have a chance to miss func_C.
> 
>     My idea here is to analyze a function prologue and determine where
>     an interrupt is actually taken, then to re-build a stack frame if necessary.
> 
>          open issue-2: it seems to be very difficult to implement such a parser
>                   for all the possible patterns of function prologue. This is
>                   partly because gcc doesn't use a fixed template to generate
>                   a function prologue code, and partly because "instruction
>                   scheduling" may mess up a function prologue and a body.
> 
>     According to comments from some toolchain guy, using libunwind with
>     compiler-generated "unwind tables" information would be most promising
>     for implementing an function prologue analyzer.
>     It seems, however, that any previous attempts to merge libunwind into
>     the kernel has been abandoned[2].

Yes, please don't.

> 
>     So, in the followings, I try to describe an outline of my own prototype
>     (simple) analyzer and how it is utilized in order to find out func_C.
> 
>     My analyzer recognizes two patterns of function prologue, (A) and (B),
>     which I think cover most cases, and identifies a location of interrupt,
>     1, 2, 3 and 0 (a normal case).
> 
>     function prologues
>     ------------------
>     (A)
>           sub sp, sp, #XX
>           stp x29, x30, [sp, #YY]
>           add x29, sp, #YY
>           ...
>     (B)
>           stp x29, x30, [sp, #-XX]
>           mov x29, sp
>           ...
> 
>     Due to gcc's instruction scheduling, we may see other instructions
>     between those assembler lines of function prologue.
> 
>     location of interrupt
>     ---------------------
>           (A)                     (B)
>     loc-1
>   	 sub sp, sp, #XX         stp, x29, x30, [sp, #-XX]
>     loc-2
>           stp x29, x30, [sp, #YY]
>     loc-3
>           add x29, sp, #YY        mov x29, sp
>     loc-0
>           <function body>         <function body>
> 
>     Here,
>            loc-1: SP not updated, FP/LR not saved, FP not updated
>            loc-2: SP updated, FP/LR not saved, FP not updated
>            loc-3: SP updated, FP/LR saved, FP not updated
>            loc-0: SP updated, FP/LR saved, FP updated
> 
>     Re-build a stack frame for func_B and func_C
>     --------------------------------------------
>     The following figures illustrate how we can re-build a stack frame for
>     func_B and func_C based on available data on the stack.
> 
>     (loc-0)
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpI   |
>          |  pcI   |
>          |  ...   |
> fpI,spI +--------+   el1_irq (pcI, fpI, spI)
>          |        |
>          |  ...   |
>          |        |
>          |  fpB   |
>          |  spB   |
>          |  pcB   |
>          |  ...   |
> fpB,spB +--------+   func_B (pcB, fpB, spB)
>          |  fpC   |
>          |  pcC   |
>          |  ...   |
> fpC,spC +--------+   func_C (pcC, fpC, spC)
>          |  fpD   |
>          |  pcD   |
>          |  ...   |
> fpD,spD +--------+   func_D (pcD, fpD, spD)
> HIGH    |        |
> 
>     => calculate
>               frame B: sp: *(spI + S_SP)
>                        fp: *(spI + S_FP)
>                        pc: *(spI + S_PC)
> 
>     (loc-3)
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpI   |
>          |  pcI   |
>          |  ...   |
> fpI,spI +--------+   el1_irq (pcI, fpI, spI)
>          |        |
>          |  ...   |
>          |        |
>          |  fpC   |
>          |  spB   |
>          |  pcB   |
>          |  ...   |
> spB     +--------+   func_B (pcB, fpC, spB)
>          |  fpC   |
>          |  pcC   |
>          |  ...   |
> fpC,spC +--------+   func_C (pcC, fpC, spC)
>          |  fpD   |
>          |  pcD   |
>          |  ...   |
> fpD,spD +--------+   func_D (pcD, fpD, spD)
> HIGH    |        |
> 
>     => calculate
>               frame B: sp: *(spI + F_SP)
>                        fp: *(spI + F_FP)
>                        pc: *(spI + F_PC)
>               frame C: sp: spB + XX
>                        fp: fpB'
>                        pc: *(spB + YY + 0x8)
> 
>     (loc-2)
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpI   |
>          |  pcI   |
>          |  ...   |
> fpI,spI +--------+   el1_irq (pcI, fpI, spI)
>          |        |
>          |  ...   |
>          |        |
>          |  fpC   |
>          |  spB   |
>          |  pcB   |
>          |  ...   |
> spB     +--------+   func_B (pcB, fpC, spB)
>          |   x    |
>          |   x    |
>          |  ...   |
> fpC,spC +--------+   func_C (pcC, fpC, spC)
>          |  fpD   |
>          |  pcD   |
>          |  ...   |
> fpD,spD +--------+   func_D (pcD, fpD, spD)
> HIGH    |        |
> 
>     => calculate
>               frame B: sp: *(spI + F_SP)
>                        fp: *(spI + F_FP)
>                        pc: *(spI + F_PC)
>               frame C: sp: spB + XX
>                        fp: fpB'
>                        pc: "branch dest at pcD"
> 
>     (loc-1)
> 
> LOW     |        |
> fpA,spA +--------+   func_A (pcA, fpA, spA)
>          |  fpI   |
>          |  pcI   |
>          |  ...   |
> fpI,spI +--------+   el1_irq (pcI, fpI, spI)
>          |        |
>          |  ...   |
>          |        |
>          |  fpC   |
>          |  spC   |
>          |  pcB   |
>          |  ...   |   func_B (pcB, fpC, spC)
> fpC,spC +--------+   func_C (pcC, fpC, spC)
>          |  fpD   |
>          |  pcD   |
>          |  ...   |
> fpD,spD +--------+   func_D (pcD, fpD, spD)
> HIGH    |        |
> 
>     => calculate
>               frame B: sp: *(spI + F_SP)
>                        fp: *(spI + F_FP)
>                        pc: *(spI + F_PC)
>               frame C: sp: spB'
>                        fp: fpB'
>                        pc: "branch dest at pcD"
> 
>          In the case of loc-1 or loc-2,
>          open issue-3: PC derived from parent's bl instruction (pcC) is
>                   an entry address of the function, not the exact place
>                   when an interrupt is taken.
>          open issue-4: If parent's bl instruction is "blr", we have no way
>                   to identify a branch destination address.
>          open issue-5: Each tack pointer is determined from child to parent,
>                   so PC for func_C cannot be fixed within wind_stackframe().

I say, go for it ;-)

> 
> 4) functions under function_graph tracer
> 
>     Unwind_frame() tries to replace "frame->pc" with an original value by
>     looking it up in current->ret_stack[].
> 
>     Unfortunately, saving a value to this array, incrementing an array index,
>     current->curr_ret_index, and updating memory on the stack are not done
>     atomically because ftrace does not disable interrupts.
>     So if an interrupt occurs and succeeding functions are also traced,
>     recovering a stack dump list (stack_trace->entries) from ret_stack[]
>     may result in a wrong list being created.

If we have a per arch implementation, we could check if we are in an
interrupt, and do some magic handling to see if the update happened in
the race location. Not sure what to do if it did. But maybe that could
work too.

> 
>          open issue-6: wrong replacement
>                   Assume that func_C calls func_B, func_B is interrupted, and
>                   el1_irq calls func_A. Depending on when an interrupt is taken
>                   during func_B, wind_stackframe() may return one of following
>                   stack dump lists:
>                   (please note that ftrace related functions are *not* traced.)
> 
>                  (A) interrupt during a function prologue
>                          ...
>                          return_to_handler(for A)
>                          el1_irq
>                          ...
>                          ftrace_push_return_trace
>                          prepare_ftrace_return
>                          ftrace_caller
>                          func_B
>                          return_to_handler(for C)
> 
>                    we have pushed func_B in ret_stack[], but LR in a stack frame
>                    has not yet been rewritten to return_to_handler.
>                    since ret_stack[] has func_A, func_B and func_C, we replace
>                          return_to_handler(for A) to func_A
>                          return_to_handler(for C) to func_B
>                    and so on.
> 
>                  (B) interrupt during a return hook
>                          ...
>                          return_to_handler(for A)
>                          el1_irq
>                          ...
>                          ftrace_pop_return_trace
>                          ftrace_return_to_handler
>                          return_to_handler(for B)
>                          return_to_handler(for C)
> 
>                    we may already have popped func_B from ret_stack[], and we
>                    replace
>                          return_to_handler(for A) to func_A
>                          return_to_handler(for B) to func_C
>                    and so on.
> 
>     In fact, there are more variant cases as pushing/popping against ret_stack[]
>     is not an atomic operation and so data in this array and its index,
>     current->ret_stack_index, may get inconsistent.

Well, if we do detect that we are in the race window, we can see the
number of return_to_handler's that are in the stack, compared to the
ret_stack[] array. And we should also know where the interrupt started,
and know where the shift occurred.

> 
> 5) leaf function
> 
>     Adding arch-specific gcc option, -mno-omit-leaf-frame-pointer, to
>     KBUILD_CFLAGS will assure that all the C functions have its own stack
>     frame (after its function prologue).
>     Then, issue 1-b will fall into issue-2.
> 
>          open issue-7: This fix will increase the code size of the kernel and
>                   add extra instructions at runtime.

Which may not be acceptable. Perhaps allow those that want accurate
stack tracing to add that config, but don't make it default if
STACK_TRACER is enabled.

-- Steve

> 
> 
> ---
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
> [2] https://lkml.org/lkml/2012/2/10/129