[RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

Alexandre Ghiti alex at ghiti.fr
Wed Mar 6 12:38:01 PST 2024


+cc Andy and Evgenii

On 06/03/2024 21:35, Alexandre Ghiti wrote:
> Hi Puranjay,
>
> On 06/03/2024 17:59, Puranjay Mohan wrote:
>> This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
>> This allows each ftrace callsite to provide an ftrace_ops to the common
>> ftrace trampoline, allowing each callsite to invoke distinct tracer
>> functions without the need to fall back to list processing or to
>> allocate custom trampolines for each callsite. This significantly speeds
>> up cases where multiple distinct trace functions are used and callsites
>> are mostly traced by a single tracer.
>>
>> The idea and most of the implementation is taken from the ARM64's
>> implementation of the same feature. The idea is to place a pointer to
>> the ftrace_ops as a literal at a fixed offset from the function entry
>> point, which can be recovered by the common ftrace trampoline.
>>
>> We use -fpatchable-function-entry to reserve 8 bytes above the function
>> entry by emitting 2 4 byte or 4 2 byte  nops depending on the 
>> presence of
>> CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer
>> to the associated ftrace_ops for that callsite. Functions are aligned to
>> 8 bytes to make sure that the accesses to this literal are atomic.
>>
>> This approach allows for directly invoking ftrace_ops::func even for
>> ftrace_ops which are dynamically-allocated (or part of a module),
>> without going via ftrace_ops_list_func.
>>
>> I've benchamrked this with the ftrace_ops sample module on Qemu, with
>> the next version, I will provide benchmarks on real hardware:
>>
>> Without this patch:
>>
>> +-----------------------+-----------------+----------------------------+
>> |  Number of tracers    | Total time (ns) | Per-call average time      |
>> |-----------------------+-----------------+----------------------------|
>> | Relevant | Irrelevant |    100000 calls | Total (ns) | Overhead (ns) |
>> |----------+------------+-----------------+------------+---------------|
>> |        0 |          0 |        15615700 |        156 |             - |
>> |        0 |          1 |        15917600 |        159 |             - |
>> |        0 |          2 |        15668000 |        156 |             - |
>> |        0 |         10 |        14971500 |        149 |             - |
>> |        0 |        100 |        15417600 |        154 |             - |
>> |        0 |        200 |        15387000 |        153 |             - |
>> |----------+------------+-----------------+------------+---------------|
>> |        1 |          0 |       119906800 |       1199 |          1043 |
>> |        1 |          1 |       137428600 |       1374 |          1218 |
>> |        1 |          2 |       159562400 |       1374 |          1218 |
>> |        1 |         10 |       302099900 |       3020 |          2864 |
>> |        1 |        100 |      2008785500 |      20087 | 19931 |
>> |        1 |        200 |      3965221900 |      39652 | 39496 |
>> |----------+------------+-----------------+------------+---------------|
>> |        1 |          0 |       119166700 |       1191 |          1035 |
>> |        2 |          0 |       157999900 |       1579 |          1423 |
>> |       10 |          0 |       425370100 |       4253 |          4097 |
>> |      100 |          0 |      3595252100 |      35952 | 35796 |
>> |      200 |          0 |      7023485700 |      70234 | 70078 |
>> +----------+------------+-----------------+------------+---------------+
>>
>> Note: per-call overhead is estimated relative to the baseline case with
>> 0 relevant tracers and 0 irrelevant tracers.
>>
>> With this patch:
>>
>> +-----------------------+-----------------+----------------------------+
>> |   Number of tracers   | Total time (ns) | Per-call average time      |
>> |-----------------------+-----------------+----------------------------|
>> | Relevant | Irrelevant |    100000 calls | Total (ns) | Overhead (ns) |
>> |----------+------------+-----------------+------------+---------------|
>> |        0 |          0 |        15254600 |        152 |             - |
>> |        0 |          1 |        16136700 |        161 |             - |
>> |        0 |          2 |        15329500 |        153 |             - |
>> |        0 |         10 |        15148800 |        151 |             - |
>> |        0 |        100 |        15746900 |        157 |             - |
>> |        0 |        200 |        15737400 |        157 |             - |
>> |----------+------------+-----------------+------------+---------------|
>> |        1 |          0 |        47909000 |        479 |           327 |
>> |        1 |          1 |        48297400 |        482 |           330 |
>> |        1 |          2 |        47314100 |        473 |           321 |
>> |        1 |         10 |        47844900 |        478 |           326 |
>> |        1 |        100 |        46591900 |        465 |           313 |
>> |        1 |        200 |        47178900 |        471 |           319 |
>> |----------+------------+-----------------+------------+---------------|
>> |        1 |          0 |        46715800 |        467 |           315 |
>> |        2 |          0 |       155134500 |       1551 |          1399 |
>> |       10 |          0 |       442672800 |       4426 |          4274 |
>> |      100 |          0 |      4092353900 |      40923 | 40771 |
>> |      200 |          0 |      7135796400 |      71357 | 71205 |
>> +----------+------------+-----------------+------------+---------------+
>>
>> Note: per-call overhead is estimated relative to the baseline case with
>> 0 relevant tracers and 0 irrelevant tracers.
>>
>> As can be seen from the above:
>>
>>   a) Whenever there is a single relevant tracer function associated 
>> with a
>>      tracee, the overhead of invoking the tracer is constant, and 
>> does not
>>      scale with the number of tracers which are *not* associated with 
>> that
>>      tracee.
>>
>>   b) The overhead for a single relevant tracer has dropped to ~1/3 of 
>> the
>>      overhead prior to this series (from 1035ns to 315ns). This is 
>> largely
>>      due to permitting calls to dynamically-allocated ftrace_ops without
>>      going through ftrace_ops_list_func.
>>
>> Why is this patch a RFC patch:
>>   1. I saw some rcu stalls on Qemu and need to debug them and see if 
>> they
>>      were introduced by this patch.
>
>
> FYI, I'm currently working on debugging such issues (and other) with 
> the *current* ftrace implementation, so probably not caused by your 
> patchset. But keep debugging too, maybe this introduces other issues 
> or even better, you'll find the root cause :)
>
>
>>   2. This needs to be tested thoroughly on real hardware.
>>   3. Seeking reviews to fix any fundamental problems with this patch 
>> that I
>>      might have missed due to my lack of RISC-V architecture knowledge.
>>   4. I would like to benchmark this on real hardware and put the 
>> results in
>>      the commit message.
>>
>> Signed-off-by: Puranjay Mohan <puranjay12 at gmail.com>
>> ---
>>   arch/riscv/Kconfig              |  2 ++
>>   arch/riscv/Makefile             |  8 +++++
>>   arch/riscv/include/asm/ftrace.h |  3 ++
>>   arch/riscv/kernel/asm-offsets.c |  3 ++
>>   arch/riscv/kernel/ftrace.c      | 59 +++++++++++++++++++++++++++++++++
>>   arch/riscv/kernel/mcount-dyn.S  | 42 ++++++++++++++++++++---
>>   6 files changed, 112 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 0bfcfec67ed5..e474742e23b2 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -78,6 +78,7 @@ config RISCV
>>       select EDAC_SUPPORT
>>       select FRAME_POINTER if PERF_EVENTS || (FUNCTION_TRACER && 
>> !DYNAMIC_FTRACE)
>>       select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY if 
>> DYNAMIC_FTRACE
>> +    select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS
>
>
> A recent discussion [1] states that -falign-functions cannot guarantee 
> this alignment for all code and that gcc developers came up with a new 
> option [2]: WDYT? I have added Andy and Evgenii in +cc to help on that.
>
> [1] 
> https://lore.kernel.org/linux-riscv/4fe4567b-96be-4102-952b-7d39109b2186@yadro.com/
> [2] 
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>
>
>>       select GENERIC_ARCH_TOPOLOGY
>>       select GENERIC_ATOMIC64 if !64BIT
>>       select GENERIC_CLOCKEVENTS_BROADCAST if SMP
>> @@ -127,6 +128,7 @@ config RISCV
>>       select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && MMU && 
>> (CLANG_SUPPORTS_DYNAMIC_FTRACE || GCC_SUPPORTS_DYNAMIC_FTRACE)
>>       select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>>       select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>> +    select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS if 
>> (DYNAMIC_FTRACE_WITH_REGS && !CFI_CLANG)
>>       select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
>>       select HAVE_FUNCTION_GRAPH_TRACER
>>       select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER
>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>> index 252d63942f34..875ad5dc3d32 100644
>> --- a/arch/riscv/Makefile
>> +++ b/arch/riscv/Makefile
>> @@ -14,12 +14,20 @@ endif
>>   ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>       LDFLAGS_vmlinux += --no-relax
>>       KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>> +ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS), y)
>> +ifeq ($(CONFIG_RISCV_ISA_C),y)
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=8,4
>> +else
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
>> +endif
>> +else
>>   ifeq ($(CONFIG_RISCV_ISA_C),y)
>>       CC_FLAGS_FTRACE := -fpatchable-function-entry=4
>>   else
>>       CC_FLAGS_FTRACE := -fpatchable-function-entry=2
>>   endif
>>   endif
>> +endif
>>     ifeq ($(CONFIG_CMODEL_MEDLOW),y)
>>   KBUILD_CFLAGS_MODULE += -mcmodel=medany
>> diff --git a/arch/riscv/include/asm/ftrace.h 
>> b/arch/riscv/include/asm/ftrace.h
>> index 329172122952..c9a84222c9ea 100644
>> --- a/arch/riscv/include/asm/ftrace.h
>> +++ b/arch/riscv/include/asm/ftrace.h
>> @@ -28,6 +28,9 @@
>>   void MCOUNT_NAME(void);
>>   static inline unsigned long ftrace_call_adjust(unsigned long addr)
>>   {
>> +    if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
>> +        return addr + 8;
>> +
>>       return addr;
>>   }
>>   diff --git a/arch/riscv/kernel/asm-offsets.c 
>> b/arch/riscv/kernel/asm-offsets.c
>> index a03129f40c46..7d7c4b486852 100644
>> --- a/arch/riscv/kernel/asm-offsets.c
>> +++ b/arch/riscv/kernel/asm-offsets.c
>> @@ -488,4 +488,7 @@ void asm_offsets(void)
>>       DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct 
>> stackframe), STACK_ALIGN));
>>       OFFSET(STACKFRAME_FP, stackframe, fp);
>>       OFFSET(STACKFRAME_RA, stackframe, ra);
>> +#ifdef CONFIG_FUNCTION_TRACER
>> +    DEFINE(FTRACE_OPS_FUNC,        offsetof(struct ftrace_ops, func));
>> +#endif
>>   }
>> diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
>> index f5aa24d9e1c1..e2e75e15d32e 100644
>> --- a/arch/riscv/kernel/ftrace.c
>> +++ b/arch/riscv/kernel/ftrace.c
>> @@ -82,9 +82,52 @@ static int __ftrace_modify_call(unsigned long 
>> hook_pos, unsigned long target,
>>       return 0;
>>   }
>>   +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
>> +static const struct ftrace_ops *riscv64_rec_get_ops(struct 
>> dyn_ftrace *rec)
>> +{
>> +    const struct ftrace_ops *ops = NULL;
>> +
>> +    if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
>> +        ops = ftrace_find_unique_ops(rec);
>> +        WARN_ON_ONCE(!ops);
>> +    }
>> +
>> +    if (!ops)
>> +        ops = &ftrace_list_ops;
>> +
>> +    return ops;
>> +}
>> +
>> +static int ftrace_rec_set_ops(const struct dyn_ftrace *rec,
>> +                  const struct ftrace_ops *ops)
>> +{
>> +    unsigned long literal = rec->ip - 8;
>> +
>> +    return patch_text_nosync((void *)literal, &ops, sizeof(ops));
>> +}
>> +
>> +static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec)
>> +{
>> +    return ftrace_rec_set_ops(rec, &ftrace_nop_ops);
>> +}
>> +
>> +static int ftrace_rec_update_ops(struct dyn_ftrace *rec)
>> +{
>> +    return ftrace_rec_set_ops(rec, riscv64_rec_get_ops(rec));
>> +}
>> +#else
>> +static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) { return 0; }
>> +static int ftrace_rec_update_ops(struct dyn_ftrace *rec) { return 0; }
>> +#endif
>> +
>>   int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>>   {
>>       unsigned int call[2];
>> +    int ret;
>> +
>> +    ret = ftrace_rec_update_ops(rec);
>> +    if (ret)
>> +        return ret;
>>         make_call_t0(rec->ip, addr, call);
>>   @@ -98,6 +141,11 @@ int ftrace_make_nop(struct module *mod, struct 
>> dyn_ftrace *rec,
>>               unsigned long addr)
>>   {
>>       unsigned int nops[2] = {NOP4, NOP4};
>> +    int ret;
>> +
>> +    ret = ftrace_rec_set_nop_ops(rec);
>> +    if (ret)
>> +        return ret;
>>         if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
>>           return -EPERM;
>> @@ -125,6 +173,13 @@ int ftrace_init_nop(struct module *mod, struct 
>> dyn_ftrace *rec)
>>     int ftrace_update_ftrace_func(ftrace_func_t func)
>>   {
>> +    /*
>> +     * When using CALL_OPS, the function to call is associated with the
>> +     * call site, and we don't have a global function pointer to 
>> update.
>> +     */
>> +    if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
>> +        return 0;
>> +
>>       int ret = __ftrace_modify_call((unsigned long)&ftrace_call,
>>                          (unsigned long)func, true, true);
>>       if (!ret) {
>> @@ -147,6 +202,10 @@ int ftrace_modify_call(struct dyn_ftrace *rec, 
>> unsigned long old_addr,
>>       make_call_t0(caller, old_addr, call);
>>       ret = ftrace_check_current_call(caller, call);
>>   +    if (ret)
>> +        return ret;
>> +
>> +    ret = ftrace_rec_update_ops(rec);
>>       if (ret)
>>           return ret;
>>   diff --git a/arch/riscv/kernel/mcount-dyn.S 
>> b/arch/riscv/kernel/mcount-dyn.S
>> index b7561288e8da..cb241e36e514 100644
>> --- a/arch/riscv/kernel/mcount-dyn.S
>> +++ b/arch/riscv/kernel/mcount-dyn.S
>> @@ -191,11 +191,35 @@
>>       .endm
>>         .macro PREPARE_ARGS
>> -    addi    a0, t0, -FENTRY_RA_OFFSET
>> +    addi    a0, t0, -FENTRY_RA_OFFSET    // ip (callsite's auipc insn)
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
>> +    /*
>> +     * When CALL_OPS is enabled (2 or 4) nops [8B] are placed before 
>> the
>> +     * function entry, these are later overwritten with the pointer 
>> to the
>> +     * associated struct ftrace_ops.
>> +     *
>> +     * -8: &ftrace_ops of the associated tracer function.
>> +     *<ftrace enable>:
>> +     *  0: auipc  t0/ra, 0x?
>> +     *  4: jalr   t0/ra, ?(t0/ra)
>> +     *
>> +     * -8: &ftrace_nop_ops
>> +     *<ftrace disable>:
>> +     *  0: nop
>> +     *  4: nop
>> +     *
>> +     * t0 is set to ip+8 after the jalr is executed at the callsite,
>> +     * so we find the associated op at t0-16.
>> +     */
>> +    mv     a1, ra                // parent_ip
>> +    REG_L   a2, -16(t0)            // op
>> +    REG_L   ra, FTRACE_OPS_FUNC(a2)        // op->func
>> +#else
>>       la    a1, function_trace_op
>> -    REG_L    a2, 0(a1)
>> -    mv    a1, ra
>> -    mv    a3, sp
>> +    REG_L    a2, 0(a1)            // op
>> +    mv    a1, ra                // parent_ip
>> +#endif
>> +    mv    a3, sp                // regs
>>       .endm
>>     #endif /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */
>> @@ -233,8 +257,12 @@ SYM_FUNC_START(ftrace_regs_caller)
>>       SAVE_ABI_REGS 1
>>       PREPARE_ARGS
>>   +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
>> +    jalr ra
>> +#else
>>   SYM_INNER_LABEL(ftrace_regs_call, SYM_L_GLOBAL)
>>       call    ftrace_stub
>> +#endif
>>         RESTORE_ABI_REGS 1
>>       bnez    t1, .Ldirect
>> @@ -247,9 +275,13 @@ SYM_FUNC_START(ftrace_caller)
>>       SAVE_ABI_REGS 0
>>       PREPARE_ARGS
>>   -SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
>> +    jalr ra
>> +#else
>> +SYM_INNER_LABEL(ftrace_regs_call, SYM_L_GLOBAL)
>>       call    ftrace_stub
>>   +#endif
>>       RESTORE_ABI_REGS 0
>>       jr    t0
>>   SYM_FUNC_END(ftrace_caller)
>
>
> As I'm diving into ftrace right now, I'll give a proper review soon. 
> But as a note, I noticed that the function_graph tracer, when enabled, 
> makes the whole system unresponsive (but still up, just very slow). A 
> fix I sent recently seems to really improve that if you're interested 
> in testing it (I am :)). You can find it here: 
> https://lore.kernel.org/linux-riscv/20240229121056.203419-1-alexghiti@rivosinc.com/
>
> Thanks,
>
> Alex
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv



More information about the linux-riscv mailing list