[PATCH v20 11/11] perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)

Thu Apr 3 09:34:46 PDT 2025

Is this something we can try to test? I have the ability to set this up 
and test it on a server system in my lab.

If we can't write up a potential test case, it is not testable, and thus 
probably not something to worry about.

On 2/25/25 14:46, Mark Rutland wrote:
> On Tue, Feb 25, 2025 at 12:38:13PM +0000, Leo Yan wrote:
>> On Mon, Feb 24, 2025 at 07:31:52PM -0600, Rob Herring wrote:
>>
>> [...]
>>
>>>>>> When event rotation happens, if without context switch, in theory we
>>>>>> should can directly use the branch record (no invalidation, no injection)
>>>>>> for all events.
>>>>> No; that only works in *some* cases, and will produce incorrect results
>>>>> in others.
>>>>>
>>>>> For example, consider filtering. Imagine a PMU with a single counter,
>>>>> and two events, where event-A filters for calls-and-returns and event-B
>>>>> filters for calls-only. When switching from event-A to event-B, it's
>>>>> theoretically possible to keep the existing records around, knowing that
>>>>> the returns can be filtered out later. When switching from event-B to
>>>>> event-A we cannot keep the existing records, since there are gaps
>>>>> whenever a return should have been recorded.
>>>> Seems to me, the problem is not caused by event rotation.  We need to
>>>> calculate a correct filter in the first place - the BRBE driver should
>>>> calculate a superset for all filters of events for a session.  Then,
>>>> generate branch record based event's specific filter.
>>> The driver doesn't have enough information. If it is told to schedule
>>> event A, it doesn't know anything about event B. It could in theory
>>> try to remember event B if event B had already been scheduled, but it
>>> never knows when event B is gone.
>> E.g., I tried below command for enabling 10 events in a perf session:
>>
>>    perf record -e armv9_nevis/r04/ -e armv9_nevis/r05/ \
>>                -e armv9_nevis/r06/ -e armv9_nevis/r07/ \
>>                -e armv9_nevis/r08/ -e armv9_nevis/r09/ \
>>                -e armv9_nevis/r10/ -e armv9_nevis/r11/ \
>>                -e armv9_nevis/r12/ -e armv9_nevis/r13/ \
>>                -- sleep 1
>>
>> For Arm PMU, the flow below is invoked for every event on every
>> affinied CPU in initialization phase:
>>
>>    armpmu_event_init() {
>>      armv8pmu_set_event_filter();
>>    }
>>
>> Shouldn't we calculate a superset branch filter for all events, store
>> it into a per-CPU data structure and then apply the filter on BRBE?
> Should we? No.
>
> *NONE* of the events in your example are CPU-bound, and the call to
> armpmu_event_init() can happen on an arbitrary CPU which the relevant
> event never actually runs on, while other unrelated events may run on
> that CPU.
>
> It makes no sense for armv8pmu_set_event_filter() to write to a per-cpu
> structure. That's purely there to determine what the filters *should* be
> when *that specific event* is programmed into hardware.
>
> As Rob and I have pointed out already, the *only* thing that can be
> relevant to deciding the configuration of HW filtering is the set of
> events which are *active* on that CPU.
>
>>>>> There are a number of cases of that shape given the set of configurable
>>>>> filters. In theory it's possible to retain those in some cases, but I
>>>>> don't think that the complexity is justified.
>>>>>
>>>>> Similarly, whenever kernel branches are recorded it's necessary to drop
>>>>> the stale branches whenever branch recording is paused, as there's
>>>>> necessarily a blackout period and hence a gap in the records.
>>>> If we save BRBE record when a process is switched out and then restore
>>>> the record when a process is switched in, should we can keep a decent
>>>> branch record for performance profiling?
>>> Keep in mind that there's only 64 branches recorded at most. How many
>>> branches in a context switch plus reconfiguring the PMU? Not a small
>>> percentage of 64 I think. In traces where freeze on overflow was not
>>> working (there's an example in v18), just the interrupt entry until
>>> BRBE was stopped was a significant part of the trace. A context switch
>>> is going to be similar.
>> That is true for kernel mode enabled tracing.  But we will have no
>> such kind noises for userspace only mode tracing.
> As mentioned elsewhere, it's not a problem for x86, so why is it
> magically a problem for arm64?
>
>>>>> Do you have a reason why you think we *must* keep events around?
>>>> Here I am really concerned are cases when a process is preempted or
>>>> migrated.  The driver doesn't save and restore branch records for these
>>>> cases, it just invalidates all records when a task is scheduled in.
>>>>
>>>> As a result, if an event overflow is close to context switch, it is
>>>> likely to capture incomplete branch records.  For a userspace-only
>>>> tracing, it is risk to capture empty branch record after preemption
>>>> and migrations.
>>> There's the same risk if something else is recording kernel branches
>>> when you are recording userspace only. I think the user has to be
>>> aware if other things like context switches are perturbing their data.
>> I am confused for the decription above.  Does it refer to branch
>> recording cross different sessions?  It is fine for me that the branch
>> data is interleaved by different sessions (e.g. one is global tracing
>> and another is only per-thread tracing).
> Imagine that there's an existing process with some pid ${PID}, and
> concurrently, the following commands are run, either by the same user or
> different users with appropriate permissions:
>
> 	# Trying to record user branches only
> 	perf record -j any,u -e cycles -p ${PID}
>
> 	# Trying to record kernel branches only
> 	perf record -j any,k -e cycles -p ${PID}
>
> Whatever you do, the task trying to record user branches only will lose
> some records:
>
> * If we make the events mutually exclusive, the branches will only be
>    recorded when the user event is installed.
>
> * If we merge the HW filters and later apply a SW filter, it's very
>    likely that kernel branches taken after exception entry have filled
>    all the records, and there are no user branches left to sample.
>
>> We might need to consider an intact branch record for the single perf
>> session case.  E.g. if userspace program calls:
>>
>>      func_a -> func_b -> func_c
>>
>> In a case for only userspace tracing, we will have no chance to preserve
>> the call sequence of these functions after the program is switched out.
> If those functions are small, it's very likely that they'll all be in
> the branch history. If they're so large that they're not executed in one
> scheduling quantum, do you expect them to fall within the same event
> period?
>
> I think that you're making a big deal out of an edge case that doesn't
> matter much in practice.
>
> Mark.
>