[PATCH] perf cs-etm: stamp pid/tid/EL on each buffered packet to fix cross-pid attribution
James Clark
james.clark at linaro.org
Tue May 26 04:18:27 PDT 2026
On 15/05/2026 3:11 am, Amir Ayupov wrote:
> In a system-wide `perf record -e cs_etm/.../u` capture on aarch64,
> synthesized samples emitted by `perf script --itrace=il64` are
> sometimes attributed to the WRONG sample.pid/tid (and to the wrong
> EL/cpumode) for the chunk of branches that straddle a context-switch
> boundary on a CPU. A branch actually retired by process A is emitted
> with sample.pid set to the thread that next ran on the same CPU.
>
> Mechanism:
> 1. ETM emits CONTEXTIDR/EL packets in-stream when the kernel updates
> CONTEXTIDR_EL1 on context switch / EL change. OpenCSD turns these
> into OCSD_GEN_TRC_ELEM_PE_CONTEXT elements interleaved with
> OCSD_GEN_TRC_ELEM_INSTR_RANGE elements for retired branch ranges.
> 2. cs_etm_decoder__buffer_range() queues each INSTR_RANGE into
> packet_queue->packet_buffer[]; packets carry start/end addrs,
> instr_count, last-instruction info, etc., but NO owner identity.
> 3. PE_CONTEXT goes through cs_etm_decoder__set_tid() ->
> cs_etm__set_thread(), which immediately mutates tidq->thread and
> tidq->el. Queued packets are not drained first; reset_timestamp()
> is called so the next TIMESTAMP triggers OCSD_RESP_WAIT and a
> drain.
> 4. By drain time in cs_etm__process_traceid_queue() ->
> cs_etm__sample(), sample.pid/tid is read from the now-mutated
> tidq->thread and sample.cpumode from the now-mutated tidq->el.
> Pre-context INSTR_RANGEs get the post-context owner.
>
> The same race affects branch samples via tidq->prev_packet_thread /
> tidq->prev_packet_el, captured at packet-swap time from
> tidq->thread / tidq->el (which may already have flipped).
>
> This is independent of PERF_RECORD_SWITCH_CPU_WIDE, which is
> deliberately not used to assign sample identity in this path. The
> bug applies to any cs_etm capture with in-stream CONTEXTIDR
> (PIDFMT_CTXTID or PIDFMT_CTXTID2).
>
> Effect on downstream tools: branches that should belong to the
> previous thread on the CPU get attributed to the next thread. When
> the two threads share a binary, leaked branches' VAs land in the
> wrong thread's mappings; samples whose IPs land in r-x mappings
> silently pollute that binary's profile, while samples landing in
> R-only/RW mappings show up as out-of-range / non-text samples.
> Either way, AutoFDO/BOLT profiles built from `perf script --itrace`
> output of system-wide cs_etm captures contain misattributed samples.
>
> Concrete example from `perf script --itrace=il64` of the same
> captured branch (same timestamp, same IP, same from/to addrs) before
> and after this fix:
>
> before: launcher_multia 2638146/2638146 705897.219172: \
> fffcda6b124c 0xfffcda641958/0xfffcda6b123c
> after: ws-tcf-sr-io13 2736581/2741587 705897.219172: \
> fffcda6b124c 0xfffcda641958/0xfffcda6b123c
>
> The branch was retired by ws-tcf-sr-io13 (tid 2741587) but, before
> the fix, was attributed to launcher_multia (the next thread to run on
> that CPU after the context switch). After the fix, it is correctly
> attributed to ws-tcf-sr-io13.
>
> Why not "drain on PE_CONTEXT then switch" (deferred-set_thread):
> tidq->thread has two consumers \u2014 sample emission needs the OUTGOING
> identity for queued packets, but cs_etm__mem_access() needs the
> CURRENT thread's maps to fetch instruction bytes for OpenCSD. The
> two needs are temporally inverted; a single tidq->thread cannot
> serve both. Keeping tidq->thread current and stamping owner identity
> per packet is the only design that decouples them cleanly.
>
> Fix: capture the owning pid/tid/EL on each buffered packet at
> cs_etm_decoder__buffer_packet() time (before any subsequent
> PE_CONTEXT can mutate tidq->thread / tidq->el), and read them at
> sample emission time.
>
> - struct cs_etm_packet gains pid_t pid, pid_t tid, int el (storing
> an ocsd_ex_level value; typed as int so the struct does not
> depend on OpenCSD headers, which are only included inside
> HAVE_CSTRACE_SUPPORT).
> - cs_etm__etmq_get_pid_tid_el() (formerly cs_etm__etmq_get_pid_tid)
> returns all three.
> - cs_etm__synth_instruction_sample() reads sample.pid / sample.tid
> from tidq->packet->{pid,tid} and derives sample.cpumode from
> tidq->packet->el.
> - cs_etm__synth_branch_sample() reads sample.pid / sample.tid /
> cpumode from tidq->prev_packet->{pid,tid,el}.
> - The separate prev_packet_thread / prev_packet_el bookkeeping in
> cs_etm__packet_swap() / cs_etm__init_traceid_queue() /
> cs_etm__free_traceid_queues() is removed; the per-packet stamp
> on prev_packet now carries that information.
>
> Cost: 12 bytes added to struct cs_etm_packet (~12-16 KB per
> packet_queue with CS_ETM_PACKET_MAX_BUFFER=1024), 16 bytes saved per
> cs_etm_traceid_queue (one struct thread * + one ocsd_ex_level).
>
> A residual gap: cs_etm__copy_insn() reads sample.insn bytes via
> cs_etm__mem_access(), which still uses tidq->thread (the current
> thread), so the inline insn bytes for an outgoing-thread sample may
> be looked up against the wrong address space. Fixing this requires
> threading the packet's owner pid through cs_etm__mem_access and is
> left for a follow-up. sample.ip / sample.pid attribution \u2014 what
> AutoFDO/BOLT consume \u2014 is correct.
>
Hi Amir,
Can you test the patch here to see if it fixes your issue [1]?
We thought it didn't make sense to store the thread on every packet when
there is only one active thread for the decoder and one for sample
generation. We also fixed the other issue mentioned above about
cs_etm__copy_insn() not working.
Thanks
James
[1]:
https://lore.kernel.org/linux-perf-users/20260526-james-cs-context-tracking-fix-v1-0-ebd602e18287@linaro.org/T/#t
More information about the linux-arm-kernel
mailing list