[RFC] stack tracer on arm64
Jungseok Lee
jungseoklee85 at gmail.com
Tue Sep 15 06:15:11 PDT 2015
On Sep 11, 2015, at 9:38 PM, AKASHI Takahiro wrote:
Hi Akashi,
> Junseok's bug report[1] about stack tracer on arm64 has exposed additional
> issues to me. I tried to fix them for the last few weeks, and submitted
> several RFC/patches. But the problems seem to be more complicated than
> I first imagined.
>
> So, before taking further steps, I'd like you to have time to read
> the following document and give me your thoughts on it.
>
> Some of my ideas, particularly a function prologue analyzer, are difficult
> to implement, and may have no guarantee to handle all the cases correctly.
Totally agreed.
> Some may have impact on code size (and extra instructions at runtime).
> I think that we should find out the balance of accuracy/conveniences of
> this tool (stack tracer) and added complexities/open issues.
Yeah, I've utilized the tool when figuring out max stack depth and its
context. Accuracy is not a critical issue. At least my case..
> -Takahiro AKASHI
>
> Table of Contents
> =================
> I. basic flow of stack tracer
>
> II. issues on the naive implementation
>
> III.my approach and open issues
>
>
> I. basic flow of stack tracer
> ------------------------------
> Every C function invokes a ftrace hook routine, _mcount or ftrace_caller,
> in its function prologue.
>
> step-1.If the current stack pointer indicates that we have a new record
> of maximum stack usage, then do the followings.
> step-2.call arch-specific save_stack_trace() to collect a list of functions
> in current call chains (we name it a stack dump list here.)
> step-3.For each function in a stack dump list, compare it against
> every 4bytes of integer on the stack data.
> step-4.If matched, we recognize the matched location in the stack as
> a current stack pointer for this function
> (for example, idxB for func_B:pcB)
> step-5.calculate a "stack index" for this function as
> <top address of stack> - <stack pointer>
> and record the value into dump_stack_index[].
>
> LOW | |
> rspA | |
> | ... <----------- static/dynamic local variables in func_A
> | |
> rbpA +--------+ func_A (pcA, rbpA, rspA)
> | rbpB |
> idxB +--------+
> | pcB |
> | ... <----------- extra function args to func_A
> rspB + - - - -+
> | |
> | ... <----------- static/dynamic local variables in func_B
> | |
> rbpB +--------+ func_B (pcB, rbpB, rspB)
> | rbpC |
> idxC +--------+
> | pcC |
> | ... <----------- extra function args to func_B
> rspC + - - - -+
> | |
> | ... <----------- static/dynamic local variables in func_C
> | |
> rbpC +--------+ func_C (pcC, rbpC, rspC)
> | |
> | ... |
> top + - - - -+
> HIGH
> * In the figure above, rbp(frame pointer) and rsp(stack
> pointer) point to respective addresses at the time after
> a function prologue has been done but before a callee function
> is called.
>
> Later on, "cat /sys/kernel/debug/tracing/stack_trace" shows the latest list
> of stack dump with sizes for each function.
> - stack size for func_B, for example, is calculated as
> <stack size> = (top - idxB) - (top - idxC)
>
>
> II. issues on the naive implementation
> --------------------------------------
> By "naive implementation," I mean how the mainline kernel without any
> modification behaves with ftrace-based stack tracer.
> (Currently, there is no arch-specific code on arm64 regarding stack tracer.)
>
> 1) slurping stack
>
> Assume that a given function A was invoked, and it was invoked again in
> another context, then it called another function B which allocated a large
> length of local variable on the stack, but it has not modified the variable
> yet.
> When stack tracer, in step-3, examines the stack looking for B, then A,
> we may have a chance to accidentally find a staled, not current, stack
> frame for A because the old frame might reside on the memory for
> the variable which has not been overwritten.
>
> a. The stack_trace output may have staled entries.
>
> 2) differences between x86 and arm64
>
> On x86, "call" instruction automatically pushes a return address on
> the top of the stack and decrements a stack pointer. Then child function
> allocates its local variables on the stack.
>
> On arm64, a child function decreases reserve memory for local variables,
> pushes a return address (LR) and old frame pointer then updates a frame
> pointer to stack pointer.
As I mentioned in other threads, this is a starting point to complicate
this stack tracer work. Is this why you need a function prologue analyzer?
IMHO, it is the best way to change this behavior, but PCS and toolchain
update might be not feasible?
>
> So step-3/4 seems OK on x86, but on arm64, an estimated stack pointer,
> idxB in the picture below, is not appropriate for our purpose because it
> contains child function's "local variables."
> We should instead use spB, if possible, for better result.
This is why a function prologue analyzer is needed. Isn't it?
>
> a. Stack size for a function in stack_trace output is inaccurate,
> or rather wrong.
> (It seems as if <Size> field is one-line offset against <Location>.)
>
> LOW | ... |
> fpA +--------+ func_A (pcA, fpA, spA)
> | fpB |
> idxB + - - - -+
> | pcB |
> | ... <----------- static local variables in func_A
> | ... | and extra function args to func_A
> spB + - - - -+
> | ... <----------- dynamically allocated variables in func_B
> fpB +--------+ func_B (pcB, fpB, spB)
> | fpC |
> idxC + - - - -+
> | pcC |
> | ... <----------- static local variables in func_B
> | ... | and extra function args to func_B
> spC + - - - -+
> | ... |
> fpC +--------+ func_C (pcC, fpC, spC)
> HIGH | |
>
> 3) Interrupted frame
> We miss one or two functions when an interrupt occurs.
> Assume that func_D calls func_C, func_C calls func_B, func_B is
> interrupted by a device, el1_irq calls func_A, and so on.
At least, arm64 could be aligned with x86 if IRQ stack is introduced on arm64.
> a.Since el1_irq (and other exception vectors) doesn't update a frame
> pointer (x29), the current wind_stackframe() tracks only
> func_A -> el1_irq -> func_C -> func_D
> and always misses func_B in a stack dump list.
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpB |
> | pcI |
> | ... |
> spI +--------+ el1_irq (pcI, fpB, spI)
> | |
> | ... |
> | |
> | fpB |
> | spB |
> | pcB |
> | ... |
> fpB,spB +--------+ func_B (pcB, fpB, spB)
> | fpC |
> | pcC |
> | ... |
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> | fpD |
> | pcD |
> | ... |
> fpD,spD +--------+ func_D (pcD, fpD, spD)
> HIGH | |
>
> b.Even worse, if an interrupt occurs in a function prolog of func_B,
> a stack frame can be uncompleted and a frame pointer may still
> points to one of func_C (fpC).
> In this case, we also miss func_C in a stack dump list.
This scenario seems inevitable regardless of architecture. It is impossible
to retrieve stack trace information without updating frame pointer when
CPU locks up in a function prologue. Thus, it is not needed to consider this
situation.
> c.As a result, stack sizes for functions around el1_irq are wrong.
>
> 4) functions under function_graph tracer
>
> To catch an event of function return, function_graph tracer rewrites
> LR register saved in a stack frame to a hook function, return_to_handler.
> An original value is saved in current->ret_stack[] on entry and restored
> later on exit by ftrace_return_to_handler() called in return_to_handler.
>
> a. With function_graph tracer on, unwind_stackframe() will return lots of
> bogus entries, "return_to_handler-0x4."
>
> Apart from stack tracer, fixing this issue might make more sense, say,
> for panic cases because panic_stack_dump() also uses unwind_frame().
>
> 5) leaf function
>
> gcc (on arm64) will, by default, optimize a leaf function by removing
> a stack frame. This causes several issues:
>
> a.we always miss a case of maximum stack size whenever a leaf function
> is at the top of call chains because a stack tracer code will never
> be executed against it.
> b.When an interrupt occurs during a leaf function (func_B), we miss its
> parent function (func_C) in a stack dump list because func_B has no
> stack frame and elr_el1 still points to func_C's frame.
> (This is a variant case as described in "interrupted frame" section.)
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpB |
> | pcI |
> | ... |
> spI +--------+ el1_irq (pcI, fpC, spI)
> (pt_regs)|
> | ... |
> | fpC |
> | spB |
> | pcB |
> spB +--------+ func_B (pcB, fpC, spB)
> | ... |
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> HIGH | |
>
> 6) KVM
> (T.B.D.)
Amazing!
>
>
> III.my approach and open issues
> -------------------------------
>
> 1) slurping stack, and
> 2) differences between x86 and arm64
>
> Instead of slurping stack data like at step-3/4, we extend save_stack_trace()
> /wind_stackframe() to collect a list of stack pointers for all call chains.
> Since arm64 ABI does not save a stack pointer in a stack frame, we try to
> calculate that value from callee's stack frame.
>
> For example, in the following case, a value of stack point when func_B
> calls func_A is calculated as:
>
> spB = fpA + (XX -- YY)
>
> where XX and YY are determined by a "function prologue analyzer" described
> in "interrupted frame" section.
>
> LOW | ... |
> fpA +--------+ func_A (pcA, fpA, spA)
> | fpB |
> | pcB |
> | ... <----------- static local variables in func_A
> | ... | and extra function args to func_A
> spB + - - - -+
> | ... <----------- dynamically allocated variables in func_B
> fpB +--------+ func_B (pcB, fpB, spB)
> | fpC |
> | pcC |
> | ... <----------- static local variables in func_B
> | ... | and extra function args to func_B
> spC + - - - -+
> | ... |
> fpC +--------+ func_C (pcC, fpC, spC)
> HIGH | |
>
> open issue-0: This fix also depends on function prologue analyzer
> being available.
>
> 3) interrupted frame
>
> First of all, we should update a frame pointer even in exception vector
> code so that unwind_frame() can dereference stack frames.
> Since exception handlers place data of 'struct pt_regs' instead of
> 'struct stackframe' in *normal* functions, we will take special
> care of this case.
>
> open issue-1: should we use the same form of stack frame here?
>
> This way, we will see an additional entry, func_B in the figure above,
> in a stack dump list but we still have a chance to miss func_C.
>
> My idea here is to analyze a function prologue and determine where
> an interrupt is actually taken, then to re-build a stack frame if necessary.
>
> open issue-2: it seems to be very difficult to implement such a parser
> for all the possible patterns of function prologue. This is
> partly because gcc doesn't use a fixed template to generate
> a function prologue code, and partly because "instruction
> scheduling" may mess up a function prologue and a body.
>
> According to comments from some toolchain guy, using libunwind with
> compiler-generated "unwind tables" information would be most promising
> for implementing an function prologue analyzer.
> It seems, however, that any previous attempts to merge libunwind into
> the kernel has been abandoned[2].
>
> So, in the followings, I try to describe an outline of my own prototype
> (simple) analyzer and how it is utilized in order to find out func_C.
>
> My analyzer recognizes two patterns of function prologue, (A) and (B),
> which I think cover most cases, and identifies a location of interrupt,
> 1, 2, 3 and 0 (a normal case).
My kallsyms based analyzer was a complete failure since it cannot address
the "open issue-7". I'm really interested in your analyzer ;)
> function prologues
> ------------------
> (A)
> sub sp, sp, #XX
> stp x29, x30, [sp, #YY]
> add x29, sp, #YY
> ...
> (B)
> stp x29, x30, [sp, #-XX]
> mov x29, sp
> ...
>
> Due to gcc's instruction scheduling, we may see other instructions
> between those assembler lines of function prologue.
>
> location of interrupt
> ---------------------
> (A) (B)
> loc-1
> sub sp, sp, #XX stp, x29, x30, [sp, #-XX]
> loc-2
> stp x29, x30, [sp, #YY]
> loc-3
> add x29, sp, #YY mov x29, sp
> loc-0
> <function body> <function body>
>
> Here,
> loc-1: SP not updated, FP/LR not saved, FP not updated
> loc-2: SP updated, FP/LR not saved, FP not updated
> loc-3: SP updated, FP/LR saved, FP not updated
> loc-0: SP updated, FP/LR saved, FP updated
>
> Re-build a stack frame for func_B and func_C
> --------------------------------------------
> The following figures illustrate how we can re-build a stack frame for
> func_B and func_C based on available data on the stack.
>
> (loc-0)
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpI |
> | pcI |
> | ... |
> fpI,spI +--------+ el1_irq (pcI, fpI, spI)
> | |
> | ... |
> | |
> | fpB |
> | spB |
> | pcB |
> | ... |
> fpB,spB +--------+ func_B (pcB, fpB, spB)
> | fpC |
> | pcC |
> | ... |
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> | fpD |
> | pcD |
> | ... |
> fpD,spD +--------+ func_D (pcD, fpD, spD)
> HIGH | |
>
> => calculate
> frame B: sp: *(spI + S_SP)
> fp: *(spI + S_FP)
> pc: *(spI + S_PC)
>
> (loc-3)
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpI |
> | pcI |
> | ... |
> fpI,spI +--------+ el1_irq (pcI, fpI, spI)
> | |
> | ... |
> | |
> | fpC |
> | spB |
> | pcB |
> | ... |
> spB +--------+ func_B (pcB, fpC, spB)
> | fpC |
> | pcC |
> | ... |
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> | fpD |
> | pcD |
> | ... |
> fpD,spD +--------+ func_D (pcD, fpD, spD)
> HIGH | |
>
> => calculate
> frame B: sp: *(spI + F_SP)
> fp: *(spI + F_FP)
> pc: *(spI + F_PC)
> frame C: sp: spB + XX
> fp: fpB'
> pc: *(spB + YY + 0x8)
>
> (loc-2)
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpI |
> | pcI |
> | ... |
> fpI,spI +--------+ el1_irq (pcI, fpI, spI)
> | |
> | ... |
> | |
> | fpC |
> | spB |
> | pcB |
> | ... |
> spB +--------+ func_B (pcB, fpC, spB)
> | x |
> | x |
> | ... |
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> | fpD |
> | pcD |
> | ... |
> fpD,spD +--------+ func_D (pcD, fpD, spD)
> HIGH | |
>
> => calculate
> frame B: sp: *(spI + F_SP)
> fp: *(spI + F_FP)
> pc: *(spI + F_PC)
> frame C: sp: spB + XX
> fp: fpB'
> pc: "branch dest at pcD"
>
> (loc-1)
>
> LOW | |
> fpA,spA +--------+ func_A (pcA, fpA, spA)
> | fpI |
> | pcI |
> | ... |
> fpI,spI +--------+ el1_irq (pcI, fpI, spI)
> | |
> | ... |
> | |
> | fpC |
> | spC |
> | pcB |
> | ... | func_B (pcB, fpC, spC)
> fpC,spC +--------+ func_C (pcC, fpC, spC)
> | fpD |
> | pcD |
> | ... |
> fpD,spD +--------+ func_D (pcD, fpD, spD)
> HIGH | |
>
> => calculate
> frame B: sp: *(spI + F_SP)
> fp: *(spI + F_FP)
> pc: *(spI + F_PC)
> frame C: sp: spB'
> fp: fpB'
> pc: "branch dest at pcD"
>
> In the case of loc-1 or loc-2,
> open issue-3: PC derived from parent's bl instruction (pcC) is
> an entry address of the function, not the exact place
> when an interrupt is taken.
> open issue-4: If parent's bl instruction is "blr", we have no way
> to identify a branch destination address.
> open issue-5: Each tack pointer is determined from child to parent,
> so PC for func_C cannot be fixed within wind_stackframe().
>
> 4) functions under function_graph tracer
>
> Unwind_frame() tries to replace "frame->pc" with an original value by
> looking it up in current->ret_stack[].
>
> Unfortunately, saving a value to this array, incrementing an array index,
> current->curr_ret_index, and updating memory on the stack are not done
> atomically because ftrace does not disable interrupts.
> So if an interrupt occurs and succeeding functions are also traced,
> recovering a stack dump list (stack_trace->entries) from ret_stack[]
> may result in a wrong list being created.
>
> open issue-6: wrong replacement
> Assume that func_C calls func_B, func_B is interrupted, and
> el1_irq calls func_A. Depending on when an interrupt is taken
> during func_B, wind_stackframe() may return one of following
> stack dump lists:
> (please note that ftrace related functions are *not* traced.)
>
> (A) interrupt during a function prologue
> ...
> return_to_handler(for A)
> el1_irq
> ...
> ftrace_push_return_trace
> prepare_ftrace_return
> ftrace_caller
> func_B
> return_to_handler(for C)
>
> we have pushed func_B in ret_stack[], but LR in a stack frame
> has not yet been rewritten to return_to_handler.
> since ret_stack[] has func_A, func_B and func_C, we replace
> return_to_handler(for A) to func_A
> return_to_handler(for C) to func_B
> and so on.
>
> (B) interrupt during a return hook
> ...
> return_to_handler(for A)
> el1_irq
> ...
> ftrace_pop_return_trace
> ftrace_return_to_handler
> return_to_handler(for B)
> return_to_handler(for C)
>
> we may already have popped func_B from ret_stack[], and we
> replace
> return_to_handler(for A) to func_A
> return_to_handler(for B) to func_C
> and so on.
>
> In fact, there are more variant cases as pushing/popping against ret_stack[]
> is not an atomic operation and so data in this array and its index,
> current->ret_stack_index, may get inconsistent.
>
> 5) leaf function
>
> Adding arch-specific gcc option, -mno-omit-leaf-frame-pointer, to
> KBUILD_CFLAGS will assure that all the C functions have its own stack
> frame (after its function prologue).
> Then, issue 1-b will fall into issue-2.
>
> open issue-7: This fix will increase the code size of the kernel and
> add extra instructions at runtime.
>
>
> ---
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html
> [2] https://lkml.org/lkml/2012/2/10/129
More information about the linux-arm-kernel
mailing list