[RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
Puranjay Mohan
puranjay12 at gmail.com
Mon Jan 5 11:31:19 PST 2026
On Mon, Jan 5, 2026 at 5:49 PM Mark Rutland <mark.rutland at arm.com> wrote:
>
> Hi Puranjay,
>
> I have a number of techincal concerns with this, noted below. At a high
> level, I don't think this makes sense without a better description of
> the usage, and I suspect we'd need changes to bpf_get_branch_snapshot()
> to be able to implement something useful.
Hi Mark,
Thanks for your review.
>
> On Fri, Jan 02, 2026 at 01:40:41PM -0800, Puranjay Mohan wrote:
> > Enable the bpf_get_branch_snapshot() BPF helper on ARM64 by implementing
> > the perf_snapshot_branch_stack static call for ARM's Branch Record
> > Buffer Extension (BRBE).
> >
> > The BPF helper bpf_get_branch_snapshot() allows BPF programs to capture
> > hardware branch records on-demand. This was previously only available on
> > x86 (Intel LBR, AMD BRS) but not on ARM64 despite BRBE being available
> > since ARMv9.
>
> When exactly can bpf_get_branch_snapshot() be called?
>
> How is it expected to be used?
>
> The driver is written on the expectation that BRBE records are only read
> in the PMUv3 overflow handler.
bpf_get_branch_snapshot() can be called from any BPF tracing program
so from unknown context.
>
> > This implementation:
> >
> > - Follows the x86 snapshot pattern (intel_pmu_snapshot_branch_stack)
> > - Performs atomic snapshot by pausing BRBE, reading records, and
> > restoring previous state without disrupting ongoing perf events
>
> If BRBE is stopped, then we lost contiguity of branch records (e.g. for
> a series of branches A->B->C->D->E, we could record A->B->D->E). The
> driver is intended to ensure contiguity of those records, and if that's
> lost, we deliberately discard. That's why brbe_enable() calls
> brbe_invalidate().
>
> Restarting BRBE without discarding will produce bogus data for other
> consumers. If BRBE is stopped, we *must* discard.
>
> > - Reads branch records directly from BRBE registers without
> > event-specific filtering to minimize branch pollution
>
> If BRBE is paused, there's no polution, so this rationale doesn't make
> sense to me.
>
> BRBE can be configured with various filters (specific to the event),
> multiple events with distinct filters can be active simultaneously, and
> if no branch-sampling events are active, BRBE won't record anything. I
> do not think it makes sense to record whatever happens to be present,
> without filtering. Regardless, BRBE won't be recording if you don't have
> events. so this doesn't seem to make sense.
In the expected usage of this bpf helper, perf_events are expected to
be created with the correct filtering.
>From 856c02dbce4f ("bpf: Introduce helper bpf_get_branch_snapshot")
Introduce bpf_get_branch_snapshot(), which allows tracing program to get
branch trace from hardware (e.g. Intel LBR). To use the feature, the
user need to create perf_event with proper branch_record filtering
on each cpu, and then calls bpf_get_branch_snapshot in the bpf function.
On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this.
>
> > - Handles all BRBE record types (complete, source-only, target-only)
> > - Complies with ARM ARM synchronization requirements (ISB barriers per
> > rule PPBZP)
> > - Reuses existing BRBE infrastructure (select_brbe_bank,
> > __read_brbe_regset, brbe_set_perf_entry_type, etc.)
>
> This duplicates other code (e.g. the body of
> brbe_read_filtered_entries()), and I don't think this is justified.
Ack.
>
> >
> > Signed-off-by: Puranjay Mohan <puranjay at kernel.org>
> > ---
> >
> > This patch is only compile tested as I don't have access to hardware with BRBE.
> >
> > ---
> > drivers/perf/arm_brbe.c | 95 ++++++++++++++++++++++++++++++++++++++++
> > drivers/perf/arm_brbe.h | 9 ++++
> > drivers/perf/arm_pmuv3.c | 5 ++-
> > 3 files changed, 108 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> > index ba554e0c846c..cda7bf522c06 100644
> > --- a/drivers/perf/arm_brbe.c
> > +++ b/drivers/perf/arm_brbe.c
> > @@ -803,3 +803,98 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> > done:
> > branch_stack->nr = nr_filtered;
> > }
> > +
> > +/*
> > + * ARM-specific callback invoked through perf_snapshot_branch_stack static
> > + * call, defined in include/linux/perf_event.h. See its definition for API
> > + * details. It's up to caller to provide enough space in *entries* to fit all
> > + * branch records, otherwise returned result will be truncated to *cnt* entries.
> > + *
> > + * This is similar to brbe_read_filtered_entries but optimized for snapshot mode:
> > + * - No filtering based on event attributes (captures everything)
> > + * - Minimal branches to avoid polluting the branch buffer
> > + * - Direct register reads without event-specific processing
> > + */
>
> As above I don't think these differences are necessary nor do I think we
> should be duplicating this.
Ack.
>
> > +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> > +{
> > + unsigned long flags;
> > + int nr_hw, nr_banks, nr_copied = 0;
> > + u64 brbidr, brbfcr, brbcr;
> > +
> > + /*
> > + * The sequence of steps to freeze BRBE should be completely inlined
> > + * and contain no branches to minimize contamination of branch snapshot.
> > + */
> > + local_irq_save(flags);
>
> The PMU overflow interrupt can be delivered as a pNMI, so this can race
> with that. Disabling interrupts is not sufficient.
In that case this implementation is completely broken, Do you have
suggestions about how this should be implemented?
The use case is to find out how the execution reached a bpf program by
looking at the Branch records. Do you think BRBE can be used to fulfil
this?
>
> If this is only called within a perf event overflow handler, then we
> already have the necessary serialization.
>
> > +
> > + /* Save current BRBE configuration */
> > + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> > + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> > +
> > + /* Pause BRBE to freeze the buffer */
> > + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> > + isb();
> > +
> > + /* Read BRBIDR to determine number of records */
> > + brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> > + if (!valid_brbidr(brbidr))
> > + goto out_restore;
> > +
> > + nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
> > + nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
> > +
> > + /* Read branch records from BRBE banks */
> > + for (int bank = 0; bank < nr_banks; bank++) {
> > + int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
> > + int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
> > +
> > + select_brbe_bank(bank);
> > +
> > + for (int i = 0; i < nr_this_bank; i++) {
> > + struct brbe_regset bregs;
> > + struct perf_branch_entry *entry;
> > +
> > + if (nr_copied >= cnt)
> > + goto out_restore;
> > +
> > + if (!__read_brbe_regset(&bregs, i))
> > + goto out_restore;
> > +
> > + entry = &entries[nr_copied];
> > + perf_clear_branch_entry_bitfields(entry);
> > +
> > + /* Simple conversion without filtering */
> > + if (brbe_record_is_complete(bregs.brbinf)) {
> > + entry->from = bregs.brbsrc;
> > + entry->to = bregs.brbtgt;
> > + } else if (brbe_record_is_source_only(bregs.brbinf)) {
> > + entry->from = bregs.brbsrc;
> > + entry->to = 0;
> > + } else if (brbe_record_is_target_only(bregs.brbinf)) {
> > + entry->from = 0;
> > + entry->to = bregs.brbtgt;
> > + }
> > +
> > + brbe_set_perf_entry_type(entry, bregs.brbinf);
> > + entry->cycles = brbinf_get_cycles(bregs.brbinf);
> > +
> > + if (!brbe_record_is_target_only(bregs.brbinf)) {
> > + entry->mispred = brbinf_get_mispredict(bregs.brbinf);
> > + entry->predicted = !entry->mispred;
> > + }
> > +
> > + if (!brbe_record_is_source_only(bregs.brbinf))
> > + entry->priv = brbinf_get_perf_priv(bregs.brbinf);
> > +
> > + nr_copied++;
> > + }
> > + }
> > +
> > +out_restore:
> > + /* Restore BRBE to its previous state */
> > + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> > + isb();
> > + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> > + local_irq_restore(flags);
> > + return nr_copied;
> > +}
>
> As above, this is duplicating a lot of logic we already have. If we
> really need this function, we should be sharing much more logic with
> brbe_read_filtered_entries(), if not using that directly.
>
> Mark.
More information about the linux-arm-kernel
mailing list