[PATCH V17 0/9] arm64/perf: Enable branch stack sampling

Wed Jun 5 23:27:20 PDT 2024


On 6/6/24 10:28, Anshuman Khandual wrote:
> On 5/30/24 23:11, Mark Rutland wrote:
>> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>>> This series enables perf branch stack sampling support on arm64 platform
>>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>>> the relevant register definitions could be accessed here.
>>>>
>>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>>
>>>> This series applies on 6.9-rc2.
>>>>
>>>> Also this series is being hosted below for quick access, review and test.
>>>>
>>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>>
>>>> There are still some open questions regarding handling multiple perf events
>>>> with different privilege branch filters getting on the same PMU, supporting
>>>> guest branch stack tracing from the host etc. Finally also looking for some
>>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>>> re-organized completely as suggested earlier.
>>>>
>>>> - Anshuman
>>>>
>>> [...]
>>>>
>>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>>
>>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>>> remain the same for all the events during a perf record session.
>>>>
>>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>>
>>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>>
>>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>>> remain the same for all events. Let's consider the following example
>>>>
>>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>>
>>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>>
>>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>>> PMU and hence captured branch records only get passed on to matching events
>>>> during a PMU interrupt.
>>>>
>>>
>>> Hi Anshuman,
>>>
>>> Surely because of this example we should merge? At least we have to try
>>> to make the most common basic command lines work. Unless we expect all
>>> tools to know whether the branch buffer is shared between PMUs on each
>>> architecture or not. The driver knows though, so can merge the settings
>>> because it all has to go into one BRBE.
>>
>> The difficulty here is that these are opened as independent events (not
>> in the same event group), and so from the driver's PoV, this is no
>> different two two users independently doing:
>>
>> 	perf record -e event:u -j any,save_type -p ${SOME_PID}
>>
>> 	perf record -e event:k -j any,save_type -p ${SOME_PID}
>>
>> ... where either would be surprised to get the merged result.
> 
> Right, that's the problem. The proposed idea here ensures that each event
> here will get only expected branch records, even though sample size might
> get reduced as the HW filters overrides might not be evenly split between
> them during the perf session.
> 
>>
>> So, if we merge the filters into the most permissive set, we *must*
>> filter them when handing them to userspace so that each event gets the
>> expected set of branch records.
> 
> Agreed, if the branch filters get merged to the most permissive set via
> an OR semantics, then results must be filtered before being pushed into
> the ring buffer for each individual event that has overflown during the
> PMU IRQ.
> 
>>
>> Assuming we do that, for Anshuman's case above:
>>
>> 	perf record -e cycles:u -e instructions:k -j any,save_type ls	
>>
>> ... the overflow of the cycles evnt will only record user branch
>> records, and the overflow of the instructions event will only record
>> kernel branch records.
> 
> Right.
> 
>>
>> I think it's arguable either way as to whether users want that or merged
>> records; we should probably figure out with the tools folk what the
>> desired behaviour is for that command line, but as above I don't think
>> that we can have the kernel give both events merged records unless
>> userspace asks for that explicitly.
> 
> Right, we should not give merged results unless explicitly asked by the
> event. Otherwise that might break the semantics.
> 
>>
>>> Merging the settings in tools would be a much harder problem.
>>
>> I can see that it'd be harder to do that up-front when parsing each
>> event independently, but there is a phase where the tool knows about all
>> of the events and *might* be able to do that, where the driver doesn't
>> really know at any point that these events are related.
>>
>> Regardless, I assume that if the user passes:
>>
>> 	perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>>
>> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
>> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
>> is sufficient.
> Kernel filtering will not be required in this case as "-j any,u,k," overrides
> both event's individual privilege request i.e cycles:u and instructions:k. So
> both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
> and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
> is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).
> 
>>
>>> Any user that doesn't have permission for anything in the merged result
>>> can continue to get nothing.
>>>
>>> And we can always add filtering in the kernel later on if we want to to
>>> make it appear to behave even more normally.
>>
>> As above, I think if we merge the HW filters in the kernel then the
>> kernel must do SW filtering. I don't think that's something we can leave
>> for later.
> 
> Alright.
> 
>>
>>>> static int
>>>> armpmu_add(struct perf_event *event, int flags)
>>>> {
>>>> 	........
>>>> 	if (has_branch_stack(event)) {
>>>> 		/*
>>>> 		 * Reset branch records buffer if a new task event gets
>>>> 		 * scheduled on a PMU which might have existing records.
>>>> 		 * Otherwise older branch records present in the buffer
>>>> 		 * might leak into the new task event.
>>>> 		 */
>>>> 		if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>>> 			hw_events->brbe_context = event->ctx;
>>>> 			if (armpmu->branch_reset)
>>>> 				armpmu->branch_reset();
>>>> 		}
>>>> 		hw_events->brbe_users++;
>>>> Here ------->	hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>>> 	}
>>>> 	........
>>>> }
>>>>
>>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>>
>>>
>>> I can't see any use case where anyone would want the override behavior.
>>> Especially if you imagine multiple users not even aware of each other.
>>
>> I completely agree that one event overriding another is not an
>> acceptable solution.
>>
>>> Either the current "no records for mismatches" or the merged one make sense.
>>
>> I think our options are:
>>
>> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>>    treating this like a counter constraint).
> 
> That's the easiest solution and will keep things simple but downside being
> the sample size will probably get much reduced. But such scenarios will be
> rare, and hence won't matter much.
> 
>>
>> 2) Allow events with conflicting HW filters to be scheduled, merge the
>>    active HW filters, and SW filter the records in the kernel.
>>
>> 3) Allow events with conflicting branch filters to be scheduled, but
>>    only those which match the "active" filter get records.
> 
> That's the proposed solution. "Active" filter gets decided on which event
> comes last thus override the previous and PMU interrupt handler delivers
> branch records only to the matching events.
> 
>>
>> So far (2) seems to me like the best option, and IIUC that's what x86
>> does with LBR. I suspect that also justifies discarding records at
>> context switch time, since we can't know how many relevant records were
>> discarded due to conflicting records (and so cannot safely stitch
>> records), and users have to expect that they may get fewer records than
>> may exist in HW anyway.
> 
> So if we implement merging branch filters requests and SW filtering for the
> captured branch records, we should also drop saving and stitching mechanism
> completely ?
> 
> Coming back to the implementation for option (2), the following code change
> (tentative and untested) will merge branch filter requests and drop the event
> filter check during PMU interrupt.
> 
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 45ac2d0ca04c..9afa4e48d957 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
>                 armv8pmu_branch_stack_reset();
>         }
>         hw_events->branch_users++;
> -       hw_events->branch_sample_type = event->attr.branch_sample_type;
> +
> +       /*
> +        * Merge all branch filter requests from different perf
> +        * events being added into this PMU. This includes both
> +        * privilege and branch type filters.
> +        */
> +       hw_events->branch_sample_type |= event->attr.branch_sample_type;
>  }
>  
>  void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 6137ae4ba7c3..c5311d5365cc 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 return;
>  
>         /*
> -        * Overflowed event's branch_sample_type does not match the configured
> -        * branch filters in the BRBE HW. So the captured branch records here
> -        * cannot be co-related to the overflowed event. Report to the user as
> -        * if no branch records have been captured, and flush branch records.
> -        * The same scenario is applicable when the current task context does
> -        * not match with overflown event.
> +        * When the current task context does not match with the PMU overflown
> +        * event, the captured branch records here cannot be co-related to the
> +        * overflowed event. Report to the user as if no branch records have
> +        * been captured, and flush branch records.
>          */
> -       if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
> -           (event->ctx->task && cpuc->branch_context != event->ctx))
> +       if (event->ctx->task && cpuc->branch_context != event->ctx)
>                 return;
>  
>         /*
> 
> and something like the following change (tentative and untested) implements the
> required SW branch records filtering.
> 
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c5311d5365cc..d2390657c466 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
>         armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
>  }
>  
> +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +       u64 mask = PERF_BR_UNCOND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY)
> +               return true;
> +
> +       if (entry->type == PERF_BR_UNKNOWN)
> +               return true;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> +               mask |= PERF_BR_IND;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_COND) {
> +               mask &= ~PERF_BR_UNCOND;
> +               mask |= PERF_BR_COND;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_CALL)
> +               mask |= PERF_BR_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> +               mask |= PERF_BR_IND_CALL;
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +               mask |= PERF_BR_CALL;
> +               mask |= PERF_BR_IRQ;
> +               mask |= PERF_BR_SYSCALL;
> +               mask |= PERF_BR_SERROR;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_CALL;
> +       }
> +
> +       if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> +               mask |= PERF_BR_RET;
> +               mask |= PERF_BR_ERET;
> +               mask |= PERF_BR_SYSRET;
> +               if (br_type & PERF_SAMPLE_BRANCH_COND)
> +                       mask |= PERF_BR_COND_RET;
> +       }
> +       return mask & entry->type;
> +}
> +
> +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> +       u64 br_type = event->attr.branch_sample_type;
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
> +               if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
> +               if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                       return false;
> +       }
> +
> +       if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
> +               if (is_kernel_in_hyp_mode()) {
> +                       if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> +                               return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
> +                                 struct branch_records *event_records)
> +{
> +       struct perf_branch_entry *entry;
> +       int idx, count;
> +
> +       memset(event_records, 0, sizeof(struct branch_records));
> +       for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
> +               entry = &cpuc->branches->branch_entries[idx];
> +               if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
> +                       continue;
> +
> +               memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
> +               count++;
> +       }
> +}
> +
>  static void read_branch_records(struct pmu_hw_events *cpuc,
>                                 struct perf_event *event,
>                                 struct perf_sample_data *data,
>                                 bool *branch_captured)
>  {
> +       struct branch_records event_records;
> +
>         /*
>          * CPU specific branch records buffer must have been allocated already
>          * for the hardware records to be captured and processed further.
> @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
>                 armv8pmu_branch_read(cpuc, event);
>                 *branch_captured = true;
>         }
> +
> +       /*
> +        * Filter captured branch records
> +        *
> +        * PMU captured branch records would contain samples applicable for
> +        * the agregated branch filters, for all events that got scheduled
> +        * on this PMU. Hence the branch records need to be filtered first
> +        * so that each individual event get samples they had requested.
> +        */
> +       if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
> +               filter_branch_records(event, cpuc, &event_records);
> +               perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
> +               return;
> +       }
>         perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
>  }

But now with the above SW filtering enabled during branch record processing, there
are two possible problems

- Potential for RCU stalls, observed them some times
- IRQ handling gets delayed while processing these records repeatedly for each event