[RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines
Mark Rutland
mark.rutland at arm.com
Wed May 25 05:27:23 PDT 2022
On Thu, May 05, 2022 at 10:57:35AM +0800, Wangshaobo (bobo) wrote:
>
> 锟斤拷 2022/4/22 0:27, Mark Rutland 写锟斤拷:
> > On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote:
> > > On Thu, 21 Apr 2022 16:14:13 +0100
> > > Mark Rutland <mark.rutland at arm.com> wrote:
> > >
> > > > > Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can
> > > > > be quite common). But each of these ftrace_ops traces a function (or
> > > > > functions) that are not being traced by the other ftrace_ops. That is, each
> > > > > ftrace_ops has its own unique function(s) that they are tracing. One could
> > > > > be tracing schedule, the other could be tracing ksoftirqd_should_run
> > > > > (whatever).
> > > > Ok, so that's when messing around with bpf or kprobes, and not generally
> > > > when using plain old ftrace functionality under /sys/kernel/tracing/
> > > > (unless that's concurrent with one of the former, as per your other
> > > > reply) ?
> > > It's any user of the ftrace infrastructure, which includes kprobes, bpf,
> > > perf, function tracing, function graph tracing, and also affects instances.
> > >
> > > > > Without this change, because the arch does not support dynamically
> > > > > allocated trampolines, it means that all these ftrace_ops will be
> > > > > registered to the same trampoline. That means, for every function that is
> > > > > traced, it will loop through all 10 of theses ftrace_ops and check their
> > > > > hashes to see if their callback should be called or not.
> > > > Sure; I can see how that can be quite expensive.
> > > >
> > > > What I'm trying to figure out is who this matters to and when, since the
> > > > implementation is going to come with a bunch of subtle/fractal
> > > > complexities, and likely a substantial overhead too when enabling or
> > > > disabling tracing of a patch-site. I'd like to understand the trade-offs
> > > > better.
> > > >
> > > > > With dynamically allocated trampolines, each ftrace_ops will have their own
> > > > > trampoline, and that trampoline will be called directly if the function
> > > > > is only being traced by the one ftrace_ops. This is much more efficient.
> > > > >
> > > > > If a function is traced by more than one ftrace_ops, then it falls back to
> > > > > the loop.
> > > > I see -- so the dynamic trampoline is just to get the ops? Or is that
> > > > doing additional things?
> > > It's to get both the ftrace_ops (as that's one of the parameters) as well
> > > as to call the callback directly. Not sure if arm is affected by spectre,
> > > but the "loop" function is filled with indirect function calls, where as
> > > the dynamic trampolines call the callback directly.
> > >
> > > Instead of:
> > >
> > > bl ftrace_caller
> > >
> > > ftrace_caller:
> > > [..]
> > > bl ftrace_ops_list_func
> > > [..]
> > >
> > >
> > > void ftrace_ops_list_func(...)
> > > {
> > > __do_for_each_ftrace_ops(op, ftrace_ops_list) {
> > > if (ftrace_ops_test(op, ip)) // test the hash to see if it
> > > // should trace this
> > > // function.
> > > op->func(...);
> > > }
> > > }
> > >
> > > It does:
> > >
> > > bl dyanmic_tramp
> > >
> > > dynamic_tramp:
> > > [..]
> > > bl func // call the op->func directly!
> > >
> > >
> > > Much more efficient!
> > >
> > >
> > > > There might be a middle-ground here where we patch the ftrace_ops
> > > > pointer into a literal pool at the patch-site, which would allow us to
> > > > handle this atomically, and would avoid the issues with out-of-range
> > > > trampolines.
> > > Have an example of what you are suggesting?
> > We can make the compiler to place 2 NOPs before the function entry point, and 2
> > NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are
> > <total>,<before>). On arm64 all instructions are 4 bytes, and we'll use the
> > first two NOPs as an 8-byte literal pool.
> >
> > Ignoring BTI for now, the compiler generates (with some magic labels added here
> > for demonstration):
> >
> > __before_func:
> > NOP
> > NOP
> > func:
> > NOP
> > NOP
> > __remainder_of_func:
> > ...
> >
> > At ftrace_init_nop() time we patch that to:
> >
> > __before_func:
> > // treat the 2 NOPs as an 8-byte literal-pool
> > .quad <default ops pointer> // see below
> > func:
> > MOV X9, X30
> > NOP
> > __remainder_of_func:
> > ...
> >
> > When enabling tracing we do
> >
> > __before_func:
> > // patch this with the relevant ops pointer
> > .quad <ops pointer>
> > func:
> > MOV X9, X30
> > BL <trampoline> // common trampoline
>
> I have a question that does this common trampoline allocated by
> module_alloc()? if yes, how to handle the long jump from traced func to
> common trampoline if only adding two NOPs in front of func.
No; as today there'd be *one* trampoline in the main kernel image, and where a
module is out-of-range it will use a PLT the module loader created at load time
(and any patch-site in that module would use the same PLT and trampoline,
regardless of what the ops pointer was).
There might be a PLT between the call and the trampoline, but that wouldn't
have any functional effect; we'd still get all the arguments, the original LR
(in x9), and the location of the call (in the LR), as we get today.
For how we do that today, see commits:
* e71a4e1bebaf7fd9 ("arm64: ftrace: add support for far branches to dynamic ftrace")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e71a4e1bebaf7fd990efbdc04b38e5526914f0f1
* f1a54ae9af0da4d7 ("arm64: module/ftrace: intialize PLT at load time")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f1a54ae9af0da4d76239256ed640a93ab3aadac0
* 3b23e4991fb66f6d ("arm64: implement ftrace with regs")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b23e4991fb66f6d152f9055ede271a726ef9f21
Thanks,
Mark.
More information about the linux-arm-kernel
mailing list