[RFC] stack tracer on arm64

Fri Sep 11 05:38:40 PDT 2015

Junseok's bug report[1] about stack tracer on arm64 has exposed additional
issues to me. I tried to fix them for the last few weeks, and submitted
several RFC/patches.  But the problems seem to be more complicated than
I first imagined.

So, before taking further steps, I'd like you to have time to read
the following document and give me your thoughts on it.

Some of my ideas, particularly a function prologue analyzer, are difficult
to implement, and may have no guarantee to handle all the cases correctly.
Some may have impact on code size (and extra instructions at runtime).
I think that we should find out the balance of accuracy/conveniences of
this tool (stack tracer) and added complexities/open issues.

-Takahiro AKASHI

Table of Contents
=================
I.  basic flow of stack tracer

II. issues on the naive implementation

III.my approach and open issues

I.  basic flow of stack tracer
------------------------------
Every C function invokes a ftrace hook routine, _mcount or ftrace_caller,
in its function prologue.

    step-1.If the current stack pointer indicates that we have a new record
           of maximum stack usage, then do the followings.
    step-2.call arch-specific save_stack_trace() to collect a list of functions
           in current call chains (we name it a stack dump list here.)
    step-3.For each function in a stack dump list, compare it against
           every 4bytes of integer on the stack data.
    step-4.If matched, we recognize the matched location in the stack as
           a current stack pointer for this function
           (for example, idxB for func_B:pcB)
    step-5.calculate a "stack index" for this function as
               <top address of stack> - <stack pointer>
           and record the value into dump_stack_index[].

LOW     |        |
rspA    |        |
         |  ... <----------- static/dynamic local variables in func_A
         |        |
rbpA    +--------+   func_A (pcA, rbpA, rspA)
         |  rbpB  |
    idxB +--------+
         |  pcB   |
         |  ... <----------- extra function args to func_A
rspB    + - - - -+
         |        |
         |  ... <----------- static/dynamic local variables in func_B
         |        |
rbpB    +--------+   func_B (pcB, rbpB, rspB)
         |  rbpC  |
    idxC +--------+
         |  pcC   |
         |  ... <----------- extra function args to func_B
rspC    + - - - -+
         |        |
         |  ... <----------- static/dynamic local variables in func_C
         |        |
rbpC    +--------+   func_C (pcC, rbpC, rspC)
         |        |
         |  ...   |
top     + - - - -+
HIGH
	* In the figure above, rbp(frame pointer) and rsp(stack
           pointer) point to respective addresses at the time after
           a function prologue has been done but before a callee function
           is called.

Later on, "cat /sys/kernel/debug/tracing/stack_trace" shows the latest list
of stack dump with sizes for each function.
     - stack size for func_B, for example, is calculated as
            <stack size> = (top - idxB) - (top - idxC)

II. issues on the naive implementation
--------------------------------------
By "naive implementation," I mean how the mainline kernel without any
modification behaves with ftrace-based stack tracer.
(Currently, there is no arch-specific code on arm64 regarding stack tracer.)

1) slurping stack

    Assume that a given function A was invoked, and it was invoked again in
    another context, then it called another function B which allocated a large
    length of local variable on the stack, but it has not modified the variable
    yet.
    When stack tracer, in step-3, examines the stack looking for B, then A,
    we may have a chance to accidentally find a staled, not current, stack
    frame for A because the old frame might reside on the memory for
    the variable which has not been overwritten.

     a. The stack_trace output may have staled entries.

2) differences between x86 and arm64

    On x86, "call" instruction automatically pushes a return address on
    the top of the stack and decrements a stack pointer. Then child function
    allocates its local variables on the stack.

    On arm64, a child function decreases reserve memory for local variables,
    pushes a return address (LR) and old frame pointer then updates a frame
    pointer to stack pointer.

    So step-3/4 seems OK on x86, but on arm64, an estimated stack pointer,
    idxB in the picture below, is not appropriate for our purpose because it
    contains child function's "local variables."
    We should instead use spB, if possible, for better result.

     a. Stack size for a function in stack_trace output is inaccurate,
        or rather wrong.
        (It seems as if <Size> field is one-line offset against <Location>.)

LOW     |  ...   |
fpA     +--------+   func_A (pcA, fpA, spA)
         |  fpB   |
    idxB + - - - -+
         |  pcB   |
         |  ... <----------- static local variables in func_A
         |  ...   |             and extra function args to func_A
spB     + - - - -+
         |  ... <----------- dynamically allocated variables in func_B
fpB     +--------+   func_B (pcB, fpB, spB)
         |  fpC   |
    idxC + - - - -+
         |  pcC   |
         |  ... <----------- static local variables in func_B
         |  ...   |             and extra function args to func_B
spC     + - - - -+
         |  ...   |
fpC     +--------+   func_C (pcC, fpC, spC)
HIGH    |        |

3) Interrupted frame
    We miss one or two functions when an interrupt occurs.
    Assume that func_D calls func_C, func_C calls func_B, func_B is
    interrupted by a device, el1_irq calls func_A, and so on.

      a.Since el1_irq (and other exception vectors) doesn't update a frame
        pointer (x29), the current wind_stackframe() tracks only
            func_A -> el1_irq -> func_C -> func_D
        and always misses func_B in a stack dump list.

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpB   |
         |  pcI   |
         |  ...   |
spI     +--------+   el1_irq (pcI, fpB, spI)
         |        |
         |  ...   |
         |        |
         |  fpB   |
         |  spB   |
         |  pcB   |
         |  ...   |
fpB,spB +--------+   func_B (pcB, fpB, spB)
         |  fpC   |
         |  pcC   |
         |  ...   |
fpC,spC +--------+   func_C (pcC, fpC, spC)
         |  fpD   |
         |  pcD   |
         |  ...   |
fpD,spD +--------+   func_D (pcD, fpD, spD)
HIGH    |        |

      b.Even worse, if an interrupt occurs in a function prolog of func_B,
        a stack frame can be uncompleted and a frame pointer may still
        points to one of func_C (fpC).
        In this case, we also miss func_C in a stack dump list.
      c.As a result, stack sizes for functions around el1_irq are wrong.

4) functions under function_graph tracer

    To catch an event of function return, function_graph tracer rewrites
    LR register saved in a stack frame to a hook function, return_to_handler.
    An original value is saved in current->ret_stack[] on entry and restored
    later on exit by ftrace_return_to_handler() called in return_to_handler.

     a. With function_graph tracer on, unwind_stackframe() will return lots of
        bogus entries, "return_to_handler-0x4."

    Apart from stack tracer, fixing this issue might make more sense, say,
    for panic cases because panic_stack_dump() also uses unwind_frame().

5) leaf function

    gcc (on arm64) will, by default, optimize a leaf function by removing
    a stack frame. This causes several issues:

     a.we always miss a case of maximum stack size whenever a leaf function
       is at the top of call chains because a stack tracer code will never
       be executed against it.
     b.When an interrupt occurs during a leaf function (func_B), we miss its
       parent function (func_C) in a stack dump list because func_B has no
       stack frame and elr_el1 still points to func_C's frame.
       (This is a variant case as described in "interrupted frame" section.)

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpB   |
         |  pcI   |
         |  ...   |
spI     +--------+   el1_irq (pcI, fpC, spI)
         (pt_regs)|
         |  ...   |
         |  fpC   |
         |  spB   |
         |  pcB   |
spB     +--------+   func_B (pcB, fpC, spB)
         |  ...   |
fpC,spC +--------+   func_C (pcC, fpC, spC)
HIGH    |        |

6) KVM
    (T.B.D.)

III.my approach and open issues
-------------------------------

1) slurping stack, and
2) differences between x86 and arm64

    Instead of slurping stack data like at step-3/4, we extend save_stack_trace()
    /wind_stackframe() to collect a list of stack pointers for all call chains.
    Since arm64 ABI does not save a stack pointer in a stack frame, we try to
    calculate that value from callee's stack frame.

    For example, in the following case, a value of stack point when func_B
    calls func_A is calculated as:

         spB = fpA + (XX -- YY)

    where XX and YY are determined by a "function prologue analyzer" described
    in "interrupted frame" section.

LOW     |  ...   |
fpA     +--------+   func_A (pcA, fpA, spA)
         |  fpB   |
         |  pcB   |
         |  ... <----------- static local variables in func_A
         |  ...   |             and extra function args to func_A
spB     + - - - -+
         |  ... <----------- dynamically allocated variables in func_B
fpB     +--------+   func_B (pcB, fpB, spB)
         |  fpC   |
         |  pcC   |
         |  ... <----------- static local variables in func_B
         |  ...   |             and extra function args to func_B
spC     + - - - -+
         |  ...   |
fpC     +--------+   func_C (pcC, fpC, spC)
HIGH    |        |

         open issue-0: This fix also depends on function prologue analyzer
                  being available.

3) interrupted frame

    First of all, we should update a frame pointer even in exception vector
    code so that unwind_frame() can dereference stack frames.
    Since exception handlers place data of 'struct pt_regs' instead of
    'struct stackframe' in *normal* functions, we will take special
    care of this case.

	open issue-1: should we use the same form of stack frame here?

    This way, we will see an additional entry, func_B in the figure above,
    in a stack dump list but we still have a chance to miss func_C.

    My idea here is to analyze a function prologue and determine where
    an interrupt is actually taken, then to re-build a stack frame if necessary.

         open issue-2: it seems to be very difficult to implement such a parser
                  for all the possible patterns of function prologue. This is
                  partly because gcc doesn't use a fixed template to generate
                  a function prologue code, and partly because "instruction
                  scheduling" may mess up a function prologue and a body.

    According to comments from some toolchain guy, using libunwind with
    compiler-generated "unwind tables" information would be most promising
    for implementing an function prologue analyzer.
    It seems, however, that any previous attempts to merge libunwind into
    the kernel has been abandoned[2].

    So, in the followings, I try to describe an outline of my own prototype
    (simple) analyzer and how it is utilized in order to find out func_C.

    My analyzer recognizes two patterns of function prologue, (A) and (B),
    which I think cover most cases, and identifies a location of interrupt,
    1, 2, 3 and 0 (a normal case).

    function prologues
    ------------------
    (A)
          sub sp, sp, #XX
          stp x29, x30, [sp, #YY]
          add x29, sp, #YY
          ...
    (B)
          stp x29, x30, [sp, #-XX]
          mov x29, sp
          ...

    Due to gcc's instruction scheduling, we may see other instructions
    between those assembler lines of function prologue.

    location of interrupt
    ---------------------
          (A)                     (B)
    loc-1
  	 sub sp, sp, #XX         stp, x29, x30, [sp, #-XX]
    loc-2
          stp x29, x30, [sp, #YY]
    loc-3
          add x29, sp, #YY        mov x29, sp
    loc-0
          <function body>         <function body>

    Here,
           loc-1: SP not updated, FP/LR not saved, FP not updated
           loc-2: SP updated, FP/LR not saved, FP not updated
           loc-3: SP updated, FP/LR saved, FP not updated
           loc-0: SP updated, FP/LR saved, FP updated

    Re-build a stack frame for func_B and func_C
    --------------------------------------------
    The following figures illustrate how we can re-build a stack frame for
    func_B and func_C based on available data on the stack.

    (loc-0)

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpI   |
         |  pcI   |
         |  ...   |
fpI,spI +--------+   el1_irq (pcI, fpI, spI)
         |        |
         |  ...   |
         |        |
         |  fpB   |
         |  spB   |
         |  pcB   |
         |  ...   |
fpB,spB +--------+   func_B (pcB, fpB, spB)
         |  fpC   |
         |  pcC   |
         |  ...   |
fpC,spC +--------+   func_C (pcC, fpC, spC)
         |  fpD   |
         |  pcD   |
         |  ...   |
fpD,spD +--------+   func_D (pcD, fpD, spD)
HIGH    |        |

    => calculate
              frame B: sp: *(spI + S_SP)
                       fp: *(spI + S_FP)
                       pc: *(spI + S_PC)

    (loc-3)

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpI   |
         |  pcI   |
         |  ...   |
fpI,spI +--------+   el1_irq (pcI, fpI, spI)
         |        |
         |  ...   |
         |        |
         |  fpC   |
         |  spB   |
         |  pcB   |
         |  ...   |
spB     +--------+   func_B (pcB, fpC, spB)
         |  fpC   |
         |  pcC   |
         |  ...   |
fpC,spC +--------+   func_C (pcC, fpC, spC)
         |  fpD   |
         |  pcD   |
         |  ...   |
fpD,spD +--------+   func_D (pcD, fpD, spD)
HIGH    |        |

    => calculate
              frame B: sp: *(spI + F_SP)
                       fp: *(spI + F_FP)
                       pc: *(spI + F_PC)
              frame C: sp: spB + XX
                       fp: fpB'
                       pc: *(spB + YY + 0x8)

    (loc-2)

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpI   |
         |  pcI   |
         |  ...   |
fpI,spI +--------+   el1_irq (pcI, fpI, spI)
         |        |
         |  ...   |
         |        |
         |  fpC   |
         |  spB   |
         |  pcB   |
         |  ...   |
spB     +--------+   func_B (pcB, fpC, spB)
         |   x    |
         |   x    |
         |  ...   |
fpC,spC +--------+   func_C (pcC, fpC, spC)
         |  fpD   |
         |  pcD   |
         |  ...   |
fpD,spD +--------+   func_D (pcD, fpD, spD)
HIGH    |        |

    => calculate
              frame B: sp: *(spI + F_SP)
                       fp: *(spI + F_FP)
                       pc: *(spI + F_PC)
              frame C: sp: spB + XX
                       fp: fpB'
                       pc: "branch dest at pcD"

    (loc-1)

LOW     |        |
fpA,spA +--------+   func_A (pcA, fpA, spA)
         |  fpI   |
         |  pcI   |
         |  ...   |
fpI,spI +--------+   el1_irq (pcI, fpI, spI)
         |        |
         |  ...   |
         |        |
         |  fpC   |
         |  spC   |
         |  pcB   |
         |  ...   |   func_B (pcB, fpC, spC)
fpC,spC +--------+   func_C (pcC, fpC, spC)
         |  fpD   |
         |  pcD   |
         |  ...   |
fpD,spD +--------+   func_D (pcD, fpD, spD)
HIGH    |        |

    => calculate
              frame B: sp: *(spI + F_SP)
                       fp: *(spI + F_FP)
                       pc: *(spI + F_PC)
              frame C: sp: spB'
                       fp: fpB'
                       pc: "branch dest at pcD"

         In the case of loc-1 or loc-2,
         open issue-3: PC derived from parent's bl instruction (pcC) is
                  an entry address of the function, not the exact place
                  when an interrupt is taken.
         open issue-4: If parent's bl instruction is "blr", we have no way
                  to identify a branch destination address.
         open issue-5: Each tack pointer is determined from child to parent,
                  so PC for func_C cannot be fixed within wind_stackframe().

4) functions under function_graph tracer

    Unwind_frame() tries to replace "frame->pc" with an original value by
    looking it up in current->ret_stack[].

    Unfortunately, saving a value to this array, incrementing an array index,
    current->curr_ret_index, and updating memory on the stack are not done
    atomically because ftrace does not disable interrupts.
    So if an interrupt occurs and succeeding functions are also traced,
    recovering a stack dump list (stack_trace->entries) from ret_stack[]
    may result in a wrong list being created.

         open issue-6: wrong replacement
                  Assume that func_C calls func_B, func_B is interrupted, and
                  el1_irq calls func_A. Depending on when an interrupt is taken
                  during func_B, wind_stackframe() may return one of following
                  stack dump lists:
                  (please note that ftrace related functions are *not* traced.)

                 (A) interrupt during a function prologue
                         ...
                         return_to_handler(for A)
                         el1_irq
                         ...
                         ftrace_push_return_trace
                         prepare_ftrace_return
                         ftrace_caller
                         func_B
                         return_to_handler(for C)

                   we have pushed func_B in ret_stack[], but LR in a stack frame
                   has not yet been rewritten to return_to_handler.
                   since ret_stack[] has func_A, func_B and func_C, we replace
                         return_to_handler(for A) to func_A
                         return_to_handler(for C) to func_B
                   and so on.

                 (B) interrupt during a return hook
                         ...
                         return_to_handler(for A)
                         el1_irq
                         ...
                         ftrace_pop_return_trace
                         ftrace_return_to_handler
                         return_to_handler(for B)
                         return_to_handler(for C)

                   we may already have popped func_B from ret_stack[], and we
                   replace
                         return_to_handler(for A) to func_A
                         return_to_handler(for B) to func_C
                   and so on.

    In fact, there are more variant cases as pushing/popping against ret_stack[]
    is not an atomic operation and so data in this array and its index,
    current->ret_stack_index, may get inconsistent.

5) leaf function

    Adding arch-specific gcc option, -mno-omit-leaf-frame-pointer, to
    KBUILD_CFLAGS will assure that all the C functions have its own stack
    frame (after its function prologue).
    Then, issue 1-b will fall into issue-2.

         open issue-7: This fix will increase the code size of the kernel and
                  add extra instructions at runtime.

---
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
[2] https://lkml.org/lkml/2012/2/10/129