[PATCH v2 1/7] perf: Increase the maximum number of samples to 256.

Wed May 21 08:36:09 PDT 2025

On Wed, May 21, 2025 at 3:47 AM Rajnesh Kanwal <rkanwal at rivosinc.com> wrote:
>
> On Thu, Apr 17, 2025 at 1:51 PM Rajnesh Kanwal <rkanwal at rivosinc.com> wrote:
> >
> > + Adding Andi Kleen.
> >
> > On Thu, Feb 20, 2025 at 6:51 PM Ian Rogers <irogers at google.com> wrote:
> > >
> > > On Thu, Jan 16, 2025 at 3:10 PM Rajnesh Kanwal <rkanwal at rivosinc.com> wrote:
> > > >
> > > > RISCV CTR extension support a maximum depth of 256 last branch records.
> > > > The 127 entries limit results in corrupting CTR entries for RISC-V if
> > > > configured to be 256 entries. This will not impact any other architectures
> > > > as it is just increasing maximum limit of possible entries.
> > >
> > > I wonder if rather than a constant this code should just use the auto
> > > resizing hashmap code?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/hashmap.h
> > >
> > > I assume the value of 127 comes from perf_event.h's PERF_MAX_STACK_DEPTH:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/perf_event.h#n1252
> > >
> > > Perhaps these constants shouldn't exist. The perf-record man page mentions:
> > > sysctl.kernel.perf_event_max_stack
> > > which I believe gets a value from
> > > /proc/sys/kernel/perf_event_max_stack, so maybe these should be
> > > runtime determined constants rather than compile time.
>
> While working on this, I came across the following two patches. It
> looks like what
> you have suggested, it was tried before but later on Arnaldo reverted the change
> from report and script cmds due to reasons mentioned in the second patch.
>
> https://lore.kernel.org/lkml/1461767472-8827-31-git-send-email-acme@kernel.org/
> https://lore.kernel.org/lkml/1463696493-27528-8-git-send-email-acme@kernel.org/

Thanks Rajnash, agreed on what you found. I wonder to resolve the
issue we could add a header feature:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/header.h?h=perf-tools-next#n21
for max stack depth.

Thanks,
Ian

> Regards
> Rajnesh
>
>
> > >
> >
> > Thanks Ian for your feedback. I am not sure if it's feasible to use auto
> > resizing hashmap here. On each sample of 256 entries we will be doing
> > 6 callocs and transferring a whole lot of entries in hashmap_grow. We
> > can't reuse old hashmap as well. On each sample we bear the same cost
> >
> > But I do agree this should be more dynamic but the maximum number
> > of entries remove_loops can process is limited by the type of chash array
> > here. I can change it and related logic to use uint16_t or higher but we
> > will still have a cap on the number of entries.
> >
> > PERF_MAX_BRANCH_DEPTH seems to be denoting what remove_loops
> > can process. This is being used by thread__resolve_callchain_sample to
> > check if the sample is processable before calling remove_loops. I think
> > this can't be changed to use perf_event_max_stack. But I can rename
> > this macro to avoid confusion.
> >
> > I didn't notice PERF_MAX_STACK_DEPTH. This seems to be defined in
> > multiple places and touches bpf as well. I agree that we should avoid
> > using this macro and use runtime determined value instead. Tbh I don't
> > have super in-depth perf understanding. I will give this a try and send a
> > patch in the next update. It would be helpful if you can review it.
> >
> > Thanks
> > -Rajnesh
> >
> > > Thanks,
> > > Ian
> > >
> > > > Signed-off-by: Rajnesh Kanwal <rkanwal at rivosinc.com>
> > > > ---
> > > >  tools/perf/util/machine.c | 21 ++++++++++++++-------
> > > >  1 file changed, 14 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > > > index 27d5345d2b30..f2eb3c20274e 100644
> > > > --- a/tools/perf/util/machine.c
> > > > +++ b/tools/perf/util/machine.c
> > > > @@ -2174,25 +2174,32 @@ static void save_iterations(struct iterations *iter,
> > > >                 iter->cycles += be[i].flags.cycles;
> > > >  }
> > > >
> > > > -#define CHASHSZ 127
> > > > -#define CHASHBITS 7
> > > > -#define NO_ENTRY 0xff
> > > > +#define CHASHBITS 8
> > > > +#define NO_ENTRY 0xffU
> > > >
> > > > -#define PERF_MAX_BRANCH_DEPTH 127
> > > > +#define PERF_MAX_BRANCH_DEPTH 256
> > > >
> > > >  /* Remove loops. */
> > > > +/* Note: Last entry (i==ff) will never be checked against NO_ENTRY
> > > > + * so it's safe to have an unsigned char array to process 256 entries
> > > > + * without causing clash between last entry and NO_ENTRY value.
> > > > + */
> > > >  static int remove_loops(struct branch_entry *l, int nr,
> > > >                         struct iterations *iter)
> > > >  {
> > > >         int i, j, off;
> > > > -       unsigned char chash[CHASHSZ];
> > > > +       unsigned char chash[PERF_MAX_BRANCH_DEPTH];
> > > >
> > > >         memset(chash, NO_ENTRY, sizeof(chash));
> > > >
> > > > -       BUG_ON(PERF_MAX_BRANCH_DEPTH > 255);
> > > > +       BUG_ON(PERF_MAX_BRANCH_DEPTH > 256);
> > > >
> > > >         for (i = 0; i < nr; i++) {
> > > > -               int h = hash_64(l[i].from, CHASHBITS) % CHASHSZ;
> > > > +               /* Remainder division by PERF_MAX_BRANCH_DEPTH is not
> > > > +                * needed as hash_64 will anyway limit the hash
> > > > +                * to CHASHBITS
> > > > +                */
> > > > +               int h = hash_64(l[i].from, CHASHBITS);
> > > >
> > > >                 /* no collision handling for now */
> > > >                 if (chash[h] == NO_ENTRY) {
> > > > --
> > > > 2.34.1
> > > >