[PATCH] perf: RISC-V: Check standard event availability

Mon Apr 15 17:01:08 PDT 2024

On 4/10/24 18:40, Samuel Holland wrote:
> Hi Atish,
> 
> On 2024-03-18 2:44 PM, Atish Patra wrote:
>> On Wed, Jan 3, 2024 at 8:54 AM Samuel Holland <samuel.holland at sifive.com> wrote:
>>>
>>> The RISC-V SBI PMU specification defines several standard hardware and
>>> cache events. Currently, all of these events appear in the `perf list`
>>> output, even if they are not actually implemented. Add logic to check
>>> which events are supported by the hardware (i.e. can be mapped to some
>>> counter), so only usable events are reported to userspace.
>>>
>>
>> Thanks for the patch.
>> This adds tons of SBI calls at every boot for a use case which is at
>> best confusing for a subset of users who actually wants to run perf.
> 
> I should have been clearer in the patch description. This is not just a cosmetic
> change; because perf sees these as valid events, it tries to use them in
> commands like `perf stat`. When the error from SBI_EXT_PMU_COUNTER_CFG_MATCH
> causes the ->add() callback to fail, this prevents any other events from being
> scheduled on that same CPU (search can_add_hw in kernel/events/core.c). That is
> why the dTLB/iTLB miss counts are missing in the "before" example below.
> 

Thanks for explaining the problem. I can reproduce it in qemu as well if 
enough number of invalid events given on the command line and the 
workload is short enough.

>> This probing can be done at runtime by invoking the
>> pmu_sbi_check_event from pmu_sbi_event_map.
>> We can update the event map after that so that it doesn't need to
>> invoke pmu_sbi_check_event next time.
> 
> I tried to implement this suggestion, but it did not work. The SBI interface
> does not distinguish between "none of the counters can monitor the specified
> event [because the event is unsupported]" and "none of the counters can monitor
> the specified event [because the counters are busy]". It is not sufficient for
> the kernel to verify that at least one counter is available before performing
> the check, because certain events may only be usable on a subset of counters
> (per riscv,event-to-mhpmcounters), and the kernel does not know that mapping.
> 

Yeah. My suggestion was to fix the perf list issue which is different 
than the issue reported now.

> As a result, checking for event support is only reliable when none of the
> counters are in use. So the check can be asynchronous/deferred to later in the
> boot process, but I believe it still needs to be performed for all events before
> userspace starts using the counters.
> 

We should defer it a work queue for sure. We can also look at improving 
SBI PMU extension to support bulk matching behavior as well.

However, I think a better solution would be to just rely on the json 
file mappings instead of making SBI calls. We are going to have the 
event encoding and mappings in the json in the future.

I had added it only for platforms with counter delegation[1] but I think 
this can be generalized for platforms with SBI PMU as well.

I had some hacks to specify the legacy event encodings but Ian rogers 
improved with a generic support by preferring sysfs/json event encodings 
over fixed ones. I am yet to rebase and try Ian's series on top of the 
counter delegation though. Thoughts ?

[1] 
https://lore.kernel.org/lkml/20240217005738.3744121-1-atishp@rivosinc.com/
[2] 
https://lore.kernel.org/bpf/20240415063626.453987-2-irogers@google.com/T/

> Regards,
> Samuel
> 
>>> Signed-off-by: Samuel Holland <samuel.holland at sifive.com>
>>> ---
>>> Before this patch:
>>> $ perf list hw
>>>
>>> List of pre-defined events (to be used in -e or -M):
>>>
>>>    branch-instructions OR branches                    [Hardware event]
>>>    branch-misses                                      [Hardware event]
>>>    bus-cycles                                         [Hardware event]
>>>    cache-misses                                       [Hardware event]
>>>    cache-references                                   [Hardware event]
>>>    cpu-cycles OR cycles                               [Hardware event]
>>>    instructions                                       [Hardware event]
>>>    ref-cycles                                         [Hardware event]
>>>    stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
>>>    stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
>>>
>>> $ perf stat -ddd true
>>>
>>>   Performance counter stats for 'true':
>>>
>>>                4.36 msec task-clock                       #    0.744 CPUs utilized
>>>                   1      context-switches                 #  229.325 /sec
>>>                   0      cpu-migrations                   #    0.000 /sec
>>>                  38      page-faults                      #    8.714 K/sec
>>>           4,375,694      cycles                           #    1.003 GHz                         (60.64%)
>>>             728,945      instructions                     #    0.17  insn per cycle
>>>              79,199      branches                         #   18.162 M/sec
>>>              17,709      branch-misses                    #   22.36% of all branches
>>>             181,734      L1-dcache-loads                  #   41.676 M/sec
>>>               5,547      L1-dcache-load-misses            #    3.05% of all L1-dcache accesses
>>>       <not counted>      LLC-loads                                                               (0.00%)
>>>       <not counted>      LLC-load-misses                                                         (0.00%)
>>>       <not counted>      L1-icache-loads                                                         (0.00%)
>>>       <not counted>      L1-icache-load-misses                                                   (0.00%)
>>>       <not counted>      dTLB-loads                                                              (0.00%)
>>>       <not counted>      dTLB-load-misses                                                        (0.00%)
>>>       <not counted>      iTLB-loads                                                              (0.00%)
>>>       <not counted>      iTLB-load-misses                                                        (0.00%)
>>>       <not counted>      L1-dcache-prefetches                                                    (0.00%)
>>>       <not counted>      L1-dcache-prefetch-misses                                               (0.00%)
>>>
>>>         0.005860375 seconds time elapsed
>>>
>>>         0.000000000 seconds user
>>>         0.010383000 seconds sys
>>>
>>> After this patch:
>>> $ perf list hw
>>>
>>> List of pre-defined events (to be used in -e or -M):
>>>
>>>    branch-instructions OR branches                    [Hardware event]
>>>    branch-misses                                      [Hardware event]
>>>    cache-misses                                       [Hardware event]
>>>    cache-references                                   [Hardware event]
>>>    cpu-cycles OR cycles                               [Hardware event]
>>>    instructions                                       [Hardware event]
>>>
>>> $ perf stat -ddd true
>>>
>>>   Performance counter stats for 'true':
>>>
>>>                5.16 msec task-clock                       #    0.848 CPUs utilized
>>>                   1      context-switches                 #  193.817 /sec
>>>                   0      cpu-migrations                   #    0.000 /sec
>>>                  37      page-faults                      #    7.171 K/sec
>>>           5,183,625      cycles                           #    1.005 GHz
>>>             961,696      instructions                     #    0.19  insn per cycle
>>>              85,853      branches                         #   16.640 M/sec
>>>              20,462      branch-misses                    #   23.83% of all branches
>>>             243,545      L1-dcache-loads                  #   47.203 M/sec
>>>               5,974      L1-dcache-load-misses            #    2.45% of all L1-dcache accesses
>>>     <not supported>      LLC-loads
>>>     <not supported>      LLC-load-misses
>>>     <not supported>      L1-icache-loads
>>>     <not supported>      L1-icache-load-misses
>>>     <not supported>      dTLB-loads
>>>              19,619      dTLB-load-misses
>>>     <not supported>      iTLB-loads
>>>               6,831      iTLB-load-misses
>>>     <not supported>      L1-dcache-prefetches
>>>     <not supported>      L1-dcache-prefetch-misses
>>>
>>>         0.006085625 seconds time elapsed
>>>
>>>         0.000000000 seconds user
>>>         0.013022000 seconds sys
>>>
>>>
>>>   drivers/perf/riscv_pmu_sbi.c | 37 ++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
>>> index 16acd4dcdb96..b58a70ee8317 100644
>>> --- a/drivers/perf/riscv_pmu_sbi.c
>>> +++ b/drivers/perf/riscv_pmu_sbi.c
>>> @@ -86,7 +86,7 @@ struct sbi_pmu_event_data {
>>>          };
>>>   };
>>>
>>> -static const struct sbi_pmu_event_data pmu_hw_event_map[] = {
>>> +static struct sbi_pmu_event_data pmu_hw_event_map[] = {
>>>          [PERF_COUNT_HW_CPU_CYCLES]              = {.hw_gen_event = {
>>>                                                          SBI_PMU_HW_CPU_CYCLES,
>>>                                                          SBI_PMU_EVENT_TYPE_HW, 0}},
>>> @@ -120,7 +120,7 @@ static const struct sbi_pmu_event_data pmu_hw_event_map[] = {
>>>   };
>>>
>>>   #define C(x) PERF_COUNT_HW_CACHE_##x
>>> -static const struct sbi_pmu_event_data pmu_cache_event_map[PERF_COUNT_HW_CACHE_MAX]
>>> +static struct sbi_pmu_event_data pmu_cache_event_map[PERF_COUNT_HW_CACHE_MAX]
>>>   [PERF_COUNT_HW_CACHE_OP_MAX]
>>>   [PERF_COUNT_HW_CACHE_RESULT_MAX] = {
>>>          [C(L1D)] = {
>>> @@ -265,6 +265,36 @@ static const struct sbi_pmu_event_data pmu_cache_event_map[PERF_COUNT_HW_CACHE_M
>>>          },
>>>   };
>>>
>>> +static void pmu_sbi_check_event(struct sbi_pmu_event_data *edata)
>>> +{
>>> +       struct sbiret ret;
>>> +
>>> +       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH,
>>> +                       0, cmask, 0, edata->event_idx, 0, 0);
>>> +       if (!ret.error) {
>>> +               sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP,
>>> +                         ret.value, 0x1, SBI_PMU_STOP_FLAG_RESET, 0, 0, 0);
>>> +       } else if (ret.error == SBI_ERR_NOT_SUPPORTED) {
>>> +               /* This event cannot be monitored by any counter */
>>> +               edata->event_idx = -EINVAL;
>>> +       }
>>> +}
>>> +
>>> +static void pmu_sbi_update_events(void)
>>> +{
>>> +       /* Ensure events are not already mapped to a counter */
>>> +       sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP,
>>> +                 0, cmask, SBI_PMU_STOP_FLAG_RESET, 0, 0, 0);
>>> +
>>> +       for (int i = 0; i < ARRAY_SIZE(pmu_hw_event_map); i++)
>>> +               pmu_sbi_check_event(&pmu_hw_event_map[i]);
>>> +
>>> +       for (int i = 0; i < ARRAY_SIZE(pmu_cache_event_map); i++)
>>> +               for (int j = 0; j < ARRAY_SIZE(pmu_cache_event_map[i]); j++)
>>> +                       for (int k = 0; k < ARRAY_SIZE(pmu_cache_event_map[i][j]); k++)
>>> +                               pmu_sbi_check_event(&pmu_cache_event_map[i][j][k]);
>>> +}
>>> +
>>>   static int pmu_sbi_ctr_get_width(int idx)
>>>   {
>>>          return pmu_ctr_list[idx].width;
>>> @@ -1046,6 +1076,9 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>>>          if (pmu_sbi_get_ctrinfo(num_counters, &cmask))
>>>                  goto out_free;
>>>
>>> +       /* Check which standard events are available */
>>> +       pmu_sbi_update_events();
>>> +
>>>          ret = pmu_sbi_setup_irqs(pmu, pdev);
>>>          if (ret < 0) {
>>>                  pr_info("Perf sampling/filtering is not supported as sscof extension is not available\n");
>>> --
>>> 2.42.0
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv