[RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

Thu Mar 7 11:27:40 PST 2024

Puranjay!

Puranjay Mohan <puranjay12 at gmail.com> writes:

> This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
> This allows each ftrace callsite to provide an ftrace_ops to the common
> ftrace trampoline, allowing each callsite to invoke distinct tracer
> functions without the need to fall back to list processing or to
> allocate custom trampolines for each callsite. This significantly speeds
> up cases where multiple distinct trace functions are used and callsites
> are mostly traced by a single tracer.
>
> The idea and most of the implementation is taken from the ARM64's
> implementation of the same feature. The idea is to place a pointer to
> the ftrace_ops as a literal at a fixed offset from the function entry
> point, which can be recovered by the common ftrace trampoline.

Not really a review, but some more background; Another rationale (on-top
of the improved per-call performance!) for CALL_OPS was to use it to
build ftrace direct call support (which BPF uses a lot!). Mark, please
correct me if I'm lying here!

On Arm64, CALL_OPS makes it possible to implement direct calls, while
only patching one BL instruction -- nice!

On RISC-V we cannot use use the same ideas as Arm64 straight off,
because the range of jal (compare to BL) is simply too short (+/-1M).
So, on RISC-V we need to use a full auipc/jal pair (the text patching
story is another chapter, but let's leave that aside for now). Since we
have to patch multiple instructions, the cmodx situation doesn't really
improve with CALL_OPS.

Let's say that we continue building on your patch and implement direct
calls on CALL_OPS for RISC-V as well.

>From Florent's commit message for direct calls:

  |    There are a few cases to distinguish:
  |    - If a direct call ops is the only one tracing a function:
  |      - If the direct called trampoline is within the reach of a BL
  |        instruction
  |         -> the ftrace patchsite jumps to the trampoline
  |      - Else
  |         -> the ftrace patchsite jumps to the ftrace_caller trampoline which
  |            reads the ops pointer in the patchsite and jumps to the direct
  |            call address stored in the ops
  |    - Else
  |      -> the ftrace patchsite jumps to the ftrace_caller trampoline and its
  |         ops literal points to ftrace_list_ops so it iterates over all
  |         registered ftrace ops, including the direct call ops and calls its
  |         call_direct_funcs handler which stores the direct called
  |         trampoline's address in the ftrace_regs and the ftrace_caller
  |         trampoline will return to that address instead of returning to the
  |         traced function

On RISC-V, where auipc/jalr is used, the direct called trampoline would
always be reachable, and then first Else-clause would never be entered.
This means the the performance for direct calls would be the same as the
one we have today (i.e. no regression!).

RISC-V does like x86 does (-ish) -- patch multiple instructions, long
reach.

Arm64 uses CALL_OPS and patch one instruction BL.

Now, with this background in mind, compared to what we have today,
CALL_OPS would give us (again assuming we're using it for direct calls):

* Better performance for tracer per-call (faster ops lookup) GOOD
* Larger text size (function alignment + extra nops) BAD
* Same direct call performance NEUTRAL
* Same complicated text patching required NEUTRAL

It would be interesting to see how the per-call performance would
improve on x86 with CALL_OPS! ;-)

I'm trying to wrap my head if it makes sense to have it on RISC-V, given
that we're a bit different from Arm64. Does the scale tip to the GOOD
side?

Oh, and we really need to see performance numbers on real HW! I have a
VF2 that I could try this series on.

Björn