[PATCH v9 0/9] perf cs-etm: Support thread stack and callchain
James Clark
james.clark at linaro.org
Wed Jun 17 03:06:38 PDT 2026
On 16/06/2026 3:51 pm, Leo Yan wrote:
> This series adds thread-stack and synthesized callchain support for Arm
> CoreSight, which comes from older series [1] but heavily rewritten.
>
> CS ETM previously kept last-branch state in a per-trace-queue buffer.
> That effectively makes the state per CPU, while the call/return history
> belongs to a thread. This series moves branch tracking to the common
> thread-stack code.
>
> The series records CoreSight branches with thread_stack__event(), uses
> thread_stack__br_sample() for last branch entries, flushes thread stacks
> after decoder resets.
>
> A decoder reset between AUX trace buffers is treated as a global trace
> discontinuity, so all thread stacks are flushed, so avoids carrying
> stale call/return history across a trace discontinuity.
>
> One limitation remains for instructions emulated by the kernel. In that
> case the exception return address may not match the return address
> stored in the thread stack, because after exception return can be one
> instruction ahead. The stack can still recover when a later return
> matches an upper caller. Given emulated instructions are not the common
> target for performance callchain analysis. Supporting this would require
> extending the common thread-stack path to accept both the real target
> address and an adjusted address for stack matching, so this series
> leaves that extra complexity out.
>
> The series has been tested on Orion6 board:
>
> perf test 136 -vvv
> 136: CoreSight synthesized callchain:
> --- start ---
> test child forked, pid 3539
> ---- end(0) ----
> 136: CoreSight synthesized callchain : Ok
>
> perf script --itrace=g16i10il64
>
> callchain_test 17468 [005] 1031003.229943: 10 instructions:
> aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
> ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
> ffff90bd233c call_init+0x9c (inlined)
> ffff90bd233c __libc_start_main_impl+0x9c (inlined)
> aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
>
> callchain_test 17468 [005] 1031003.229943: 10 instructions:
> aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
> ffff90bd233c call_init+0x9c (inlined)
> ffff90bd233c __libc_start_main_impl+0x9c (inlined)
> aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
>
> callchain_test 17468 [005] 1031003.229944: 10 instructions:
> ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
> aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
> ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
> ffff90bd233c call_init+0x9c (inlined)
> ffff90bd233c __libc_start_main_impl+0x9c (inlined)
> aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
>
> Note, the test fails on Juno board which is caused by many discontinuity
> packets (mainly caused by NO_SYNC elem). This is likely caused by the
> FIFO overflow on the path.
It passes 20/20 times on my r0 Juno board. But I wonder if reducing the
timestamp interval fixes it by reducing the rate of trace generation?
I'm just about to send patches that will change it to "timestamp=14" for
--per-thread mode (only context packet timestamps). And "timestamp=7"
for per-cpu mode (64 cycle interval).
>
> [1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linaro.org/
>
> Signed-off-by: Leo Yan <leo.yan at arm.com>
> ---
> Changes in v9:
> - Added patch 01 to fixed thread leak during trace queue init (sashiko).
> - Added check in instruction and branch samples in
> cs_etm__add_stack_event() (sashiko).
> - Released frontend_thread properly in cs_etm__context() (sashiko).
> - Refined cs_etm__flush_all_stack() to use switch (sashiko).
> - Gathered James' review tags.
> - Rebased on the latest perf-tools-next.
> - Link to v8: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v8-0-737948584fea@arm.com
>
> Changes in v8:
> - Updated test_arm_coresight_disasm.sh to pass "--itrace=b" and updated
> examples in arm-cs-trace-disasm.py (James).
> - Removed static annotation in callchain workload and renamed functions
> with prefix "callchain_" to reduce naming conflict (James).
> - For callchain test pre-condition check, removed the aarch64 check and
> added the root permission check (James).
> - Resolved the shellcheck errors (James).
> - Link to v7: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v7-0-1ba770c862ae@arm.com
>
> Changes in v7:
> - Rebased on the latest perf-tools-next.
> - Used struct_size() for allocation callchain struct (James).
> - Added a helper cs_etm__packet_has_taken_branch() (James).
> - Minor improvements for the callchain test (used record-ctl FIFO and
> reworked the validation callstack push / pop).
> - Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com
>
> Changes in v6:
> - Heavily rewrote the patches since restarted the work after 6 years.
> - Changed to use the common thread-stack for branch stack and callchain
> management.
> - Added a callchain test.
> - Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linaro.org/
>
> Changes in v5:
> - Addressed Mike's suggestion for performance improvement for function
> cs_etm__instr_addr() for quick calculation for non T32;
> - Removed the patch 'perf cs-etm: Synchronize instruction sample with
> the thread stack' (Mike);
> - Fixed the issue for exception is taken for branch target address
> accessing, for the branch sample and stack thread handling, the
> related patches are 01, 02, 07;
> - Fixed the stack thread handling for instruction emulation and single
> step with patches 08, 09.
> - Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@linaro.org/
>
> ---
> Leo Yan (9):
> perf cs-etm: Fix thread leaks on trace queue init failure
> perf cs-etm: Filter synthesized branch samples
> perf cs-etm: Decode ETE exception packets
> perf cs-etm: Refactor instruction size handling
> perf cs-etm: Use thread-stack for last branch entries
> perf cs-etm: Flush thread stacks after decoder reset
> perf cs-etm: Support call indentation
> perf cs-etm: Synthesize callchains for instruction samples
> perf test: Add Arm CoreSight callchain test
>
> tools/perf/Documentation/perf-test.txt | 6 +-
> tools/perf/scripts/python/arm-cs-trace-disasm.py | 9 +-
> tools/perf/tests/builtin-test.c | 1 +
> tools/perf/tests/shell/coresight/callchain.sh | 172 ++++++++++
> .../shell/coresight/test_arm_coresight_disasm.sh | 4 +-
> tools/perf/tests/tests.h | 1 +
> tools/perf/tests/workloads/Build | 2 +
> tools/perf/tests/workloads/callchain.c | 33 ++
> tools/perf/util/cs-etm.c | 367 +++++++++++++--------
> 9 files changed, 446 insertions(+), 149 deletions(-)
> ---
> base-commit: 44543bb53ebfecb5ce4e890053a666affab9e482
> change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
>
> Best regards,
More information about the linux-arm-kernel
mailing list