RISC-V PMU driver issue

Fri Apr 26 17:07:07 PDT 2024

On 3/18/24 15:18, Atish Patra wrote:
> On 3/12/24 21:29, 雷博涵 wrote:
>> Sorry for the duplicated message. The last one was an HTML e-mail so I 
>> resend this.
>>
>>> 2024年3月13日 08:02，Atish Patra <atishp at atishpatra.org> 写道：
>>>
>>> On Tue, Feb 27, 2024 at 7:49 PM 雷博涵 <garthlei at pku.edu.cn> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am having problems with the RISC-V PMU driver. The overflow 
>>>> handler of my custom perf event kernel counter seems to read an 
>>>> incorrect value from the event.
>>>
>>> If I understand you correctly, you have a custom kernel counter for
>>> the perf event ? Can you explain the use case for that ?
>>
>> Actually it was the perf-event backend of an open-source tool called 
>> PMCTrack (https://github.com/jcsaezal/pmctrack), which I was trying to 
>> port to RISC-V. It creates a kernel counter using 
>> `perf_event_create_kernel_counter`, registering an overflow handler. 
>> The registered overflow handler calls `perf_event_read_value` to read 
>> the counter value.
>>
> 
> Thanks. It makes sense now.
> 
>>>
>>>> It seems that the issue lies in the function 
>>>> `riscv_pmu_event_set_period`, which sets `prev_count` to a new value 
>>>> but does not modify the underlying hardware counter. When 
>>>> `perf_event_read_value` gets called later in the user-defined 
>>>> overflow handler, `riscv_pmu_event_update` will update the `count` 
>>>> field again based on the unmodified hardware counter value and the 
>>>> modified `prev_count` field, which causes an incorrect reading.
>>>
>>> What do you mean by user defined overflow handler ? The overflow
>>> handler registered via the custom perf event kernel counter which gets
>>> invoked from perf_event_overflow ?
>>
>> Yes.
>>
>>>
>>>> I noticed that other PMU drivers, such as the ARM one, write to the 
>>>> underlying counter in their set_period functions, which prevents the 
>>>> problem. However, the RISC-V SBI specification does not have such an 
>>>> API to write to a counter without starting it. Using 
>>>> `local64_read(&hw_evt->period_left) <= 0` directly as the guard 
>>>> condition to place `riscv_pmu_event_set_period(event)` after it 
>>>> seems to work, but I am not sure whether it can cause other issues.
>>>
>>> Not sure the exact code path you are referring to. You don't want to
>>> invoke  riscv_pmu_event_set_period if period_left <= 0 ?
>>
>> First, `pmu_sbi_ovf_handler` calls `riscv_pmu_event_update`, so the 
>> perf event counter value (`event->count`) is updated. It calls 
>> `riscv_pmu_event_set_period` later, which calls 
>> `local64_set(&hwc->prev_count, (u64)-left)`. When 
>> `perf_event_read_value` gets invoked in the registered overflow 
>> handler, `riscv_pmu_event_update` will be invoked eventually. Because 
>> `prev_count` has been set to `(u64)-left` and the underlying hardware 
>> counter value (the value read via `rvpmu->ctr_read(event)`) has not 
>> been changed, `event->count` will be updated again, and the value this 
>> time will be incorrect.
> 
> Got it. This is a genuine issue. I can reproduce it as well with kvm pmu 
> support as it creates a kernel perf event as well.
> 
>>
>> I have found that the ARM PMU does not have the problem. The function 
>> `armpmu_event_set_period` in `arm_pmu.c` calls 
>> `armpmu->write_counter(event, (u64)(-left) & max_period)` after 
>> `local64_set(&hwc->prev_count, (u64)-left)`, so that the underlying 
>> hardware counter is also set to a new value. Thus, when 
>> `armpmu_event_update` gets called later in the registered handler, it 
>> will find that the underlying counter has the same value as 
>> `prev_count`, so `event->count` can keep the correct value.
>>
>> However, as I mentioned before, the RISC-V SBI specification does not 
>> have such an API as “write_counter,” so the ARM solution could not be 
>> adapted easily. I was wondering if we could postpone 
>> `riscv_pmu_event_set_period` after the if-block, like this:
>>
>> ```
>> if (local64_read(&hw_evt->period_left) <= 0) {
>>    perf_event_overflow(event, &data, regs);
>> }
>> riscv_pmu_event_set_period(event);
>> ```
>>
>> This way `perf_event_read_value` can return the correct counter value. 
>> `riscv_pmu_event_set_period` gets invoked after perf_event_overflow. 
>> However, I am not sure if it can cause other problems.
>>
> 
> That seems like a hack. I think we can avoid updating again based on 
> perf state which can indicate that event->count is already updated. 
> Thus, no need to update it again. I will toy around with this idea and 
> get back to you.
> 

Can you try this diff on latest upstream ?

diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index b4efdddb2ad9..e08529c45c50 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -167,7 +167,7 @@ u64 riscv_pmu_event_update(struct perf_event *event)
         unsigned long cmask;
         u64 oldval, delta;

-       if (!rvpmu->ctr_read)
+       if (!rvpmu->ctr_read || (hwc->state == PERF_HES_UPTODATE))
                 return 0;

         cmask = riscv_pmu_ctr_get_width_mask(event);
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 8cbe6e5f9c39..6ba825a38619 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -763,7 +763,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, 
void *dev)
                  */
                 overflowed_ctrs |= BIT(lidx);
                 hw_evt = &event->hw;
+               /* Update the event states here so that we know the 
state while reading */
+               hw_evt->state |= PERF_HES_STOPPED;
                 riscv_pmu_event_update(event);
+               hw_evt->state |= PERF_HES_UPTODATE;
                 perf_sample_data_init(&data, 0, hw_evt->last_period);
                 if (riscv_pmu_event_set_period(event)) {
                         /*
@@ -776,6 +779,8 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void 
*dev)
                          */
                         perf_event_overflow(event, &data, regs);
                 }
+               /* Reset the state as we are going to start the counter 
after the loop */
+               hw_evt->state = 0;
         }



All the changes in drivers/perf/riscv_pmu_sbi.c are already part of 
ongoing series with snapshot support[1].

I will send the changes in drivers/perf/riscv_pmu.c as a separate fix if 
this works for you.

[1] https://lore.kernel.org/kvm/20240420151741.962500-1-atishp@rivosinc.com/

>>>
>>>>
>>>> Thank you,
>>>> Bohan Lei
>>>
>>>
>>> -- 
>>> Regards,
>>> Atish
>>
>> -- 
>> Regards,
>> Bohan Lei
>>
>> _______________________________________________
>> linux-riscv mailing list
>> linux-riscv at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv