[RFC] arm64: ftrace with regs for livepatch support

Mon Dec 14 00:22:13 PST 2015

on 2015/11/20 19:47, AKASHI Takahiro wrote:
> In this RFC, I'd like to describe and discuss some issues on adding ftrace/
> livepatch support on arm64 before actually submitting patches. In fact,
> porting livepatch is not a complicated task, but adding "ftrace with
> regs(CONFIG_DYNAMIC_FTRACE_WITH_REGS)" which livepatch heavily relies on
> is a matter.
> (There is another discussion about "arch-independent livepatch" in LKML.)
>
> Under "ftrace with regs", a ftrace helper function (ftrace_regs_caller)
> will be called with cpu registers (struct pt_regs_t) at the beginning of
> a function if tracing is enabled on the function. Livepatch utilizes this
> argument to replace PC and jump back into a new (patched) function.
> (Please note that this feature will also be used for ftrace-based kprobes.)
>
> On arm64, there is no template for a function prologue, and "instruction
> scheduling" may mix it with a function body. So a helper function, which
> is inserted by gcc's "-pg" option, cannot (at least potentially) recognize
> correct values of registers because some may have already been overwritten
> at that point.
>
> Instead, X86 uses gcc's "-mfentry" option, which inserts "call _mcount" as
> the first instruction of a function, to implement "ftrace with regs".
> As this option is arch-specific, after discussions with toolchain folks,
> we are proposing a new arch-neutral option, "-fprolog-pad=N"[1].
> This option inserts N nop instructions before a function prologue so that
> any architecture can utilize it to replace nops with whatever instruction
> sequence they want later on when required.
> (I assume that nop is very cheap in terms of performance impact.)
>
> First, let me explain how we can implement "ftrace with regs", or more
> specifically, ftrace_make_call() and ftrace_make_nop() as well as how
> inserted instruction sequences look like. Implementing ftrace_regs_caller
> is quite straightforward, we don't have to care (at least, in this RFC).
>
> 1) instruction sequence
> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
> a ftrace help function will be invoked before a function prologue. so we
> need a few, not one, instructions here. Two possible ways:
>
>  (a) stp x29, x30, [sp, #-16]!
>      mov x29, sp
>      bl <mcount>
>      ldp x29, x30, [sp], #16
>      <function prologue>
>      ...
>
>  (b) mov x9, x30
>      bl <mcount>
>      mov x30, x9
>      <function prologue>
>      ...
>
> (a) complies with a normal calling convention.
> (b) is Li Bin's idea in his old patch. While (b) can save some memory
> accesses by using a scratch register(x9 in this example), we have no way
> to recover an actual value for this register.
>
>       Q#1. Which approach should we take here?
>

I think the more appropriate way to implement the livepatch on arm64 is to
directly modify the instruction with the help of the gcc "-fprolog-pad=N"option
and the N only needs 1, rather than basing on ftrace.

func:
    nop    <--->  b <(new_func1 - func)>  <--->   b <(new_func2 - func)>       
    [prologue]

And that NOP and B are both safe instructions which called "concurrent modification
and execution of instructions", that can be executed by one thread of execution as
they are being modified by another thread of execution without requiring explicit
synchronization.

On arm64, this method will improve performance significantly compared with the method
based on ftrace, especially for the critical function being frequently called.

Can we modify the livepatch to allow the arch specific implementation? Such as that
making the klp_enable_func/klp_disable_func as the weak function and allow their
implementations be architecture sepcific that not use ftrace. I already have a prototype
patchset and have test it and will post them soon.

Thanks,
Li Bin

>
> 2) replacing an instruction sequence
>    (This issue is orthogonal to Q#1.)
>
> Replacing can happen anytime, so we have to do it (without any locking) in
> such a safe way that any task either calls a helper or doesn't call it, but
> never runs in any intermediate state.
>
> Again here, two possible ways:
>
>   (a) initialize the code in the shape of (A') at boot time,
>             (B) -> (B') -> (A')
>       then switching to (A) or (A')
>   (b) take a few steps each time. For example,
>       to enable tracing,
>             (B) -> (B') -> (A') -> (A)
>       to disable tracing,
>             (A) -> (A') -> (B') -> (A)
>       Obviously, we need cache flushing/invalidation and barriers between.
>
>     (A)                                (A')
>         stp x29, x30, [sp, #-16]!           b 1f
>         mov x29, sp                         mov x29, sp
>         bl <_mcount>                        bl <_mcount>
>         ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>
>     (B)                                (B')
>         nop                                 b 1f
>         nop                                 nop
>         nop                                 nop
>         nop                                 nop
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>
> (a) is much simpler, but (b) has less performance penalty(?) when tracing
> is disabled. I'm afraid that I might simplify the issue too much.
>
>        Q#2. Which one is more preferable?
>
>
> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>     https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>
>
> Thanks,
> -Takahiro AKASHI
>
> .
>