[PATCH] perf: RISC-V: Check standard event availability

Wed Apr 24 22:46:57 PDT 2024

On Thu, Apr 25, 2024 at 2:36 AM Atish Kumar Patra <atishp at rivosinc.com> wrote:
>
> On Wed, Apr 24, 2024 at 11:27 AM Ian Rogers <irogers at google.com> wrote:
> >
> > On Wed, Apr 24, 2024 at 6:31 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > >
> > > On Tue, Apr 23, 2024 at 05:36:43PM -0700, Atish Kumar Patra wrote:
> > > > On Mon, Apr 22, 2024 at 8:44 PM Ian Rogers <irogers at google.com> wrote:
> > > > >
> > > > > On Mon, Apr 15, 2024 at 9:07 PM Anup Patel <anup at brainfault.org> wrote:
> > > > > >
> > > > > > On Tue, Apr 16, 2024 at 5:31 AM Atish Patra <atishp at rivosinc.com> wrote:
> > > > > > >
> > > > > > > On 4/10/24 18:40, Samuel Holland wrote:
> > > > > > > > Hi Atish,
> > > > > > > >
> > > > > > > > On 2024-03-18 2:44 PM, Atish Patra wrote:
> > > > > > > >> On Wed, Jan 3, 2024 at 8:54 AM Samuel Holland <samuel.holland at sifive.com> wrote:
> > > > > > > >>>
> > > > > > > >>> The RISC-V SBI PMU specification defines several standard hardware and
> > > > > > > >>> cache events. Currently, all of these events appear in the `perf list`
> > > > > > > >>> output, even if they are not actually implemented. Add logic to check
> > > > > > > >>> which events are supported by the hardware (i.e. can be mapped to some
> > > > > > > >>> counter), so only usable events are reported to userspace.
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> Thanks for the patch.
> > > > > > > >> This adds tons of SBI calls at every boot for a use case which is at
> > > > > > > >> best confusing for a subset of users who actually wants to run perf.
> > > > > > > >
> > > > > > > > I should have been clearer in the patch description. This is not just a cosmetic
> > > > > > > > change; because perf sees these as valid events, it tries to use them in
> > > > > > > > commands like `perf stat`. When the error from SBI_EXT_PMU_COUNTER_CFG_MATCH
> > > > > > > > causes the ->add() callback to fail, this prevents any other events from being
> > > > > > > > scheduled on that same CPU (search can_add_hw in kernel/events/core.c). That is
> > > > > > > > why the dTLB/iTLB miss counts are missing in the "before" example below.
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for explaining the problem. I can reproduce it in qemu as well if
> > > > > > > enough number of invalid events given on the command line and the
> > > > > > > workload is short enough.
> > > > > > >
> > > > > > > >> This probing can be done at runtime by invoking the
> > > > > > > >> pmu_sbi_check_event from pmu_sbi_event_map.
> > > > > > > >> We can update the event map after that so that it doesn't need to
> > > > > > > >> invoke pmu_sbi_check_event next time.
> > > > > > > >
> > > > > > > > I tried to implement this suggestion, but it did not work. The SBI interface
> > > > > > > > does not distinguish between "none of the counters can monitor the specified
> > > > > > > > event [because the event is unsupported]" and "none of the counters can monitor
> > > > > > > > the specified event [because the counters are busy]". It is not sufficient for
> > > > > > > > the kernel to verify that at least one counter is available before performing
> > > > > > > > the check, because certain events may only be usable on a subset of counters
> > > > > > > > (per riscv,event-to-mhpmcounters), and the kernel does not know that mapping.
> > > > > > > >
> > > > > > >
> > > > > > > Yeah. My suggestion was to fix the perf list issue which is different
> > > > > > > than the issue reported now.
> > > > > > >
> > > > > > > > As a result, checking for event support is only reliable when none of the
> > > > > > > > counters are in use. So the check can be asynchronous/deferred to later in the
> > > > > > > > boot process, but I believe it still needs to be performed for all events before
> > > > > > > > userspace starts using the counters.
> > > > > > > >
> > > > > > >
> > > > > > > We should defer it a work queue for sure. We can also look at improving
> > > > > > > SBI PMU extension to support bulk matching behavior as well.
> > > > > > >
> > > > > > > However, I think a better solution would be to just rely on the json
> > > > > > > file mappings instead of making SBI calls. We are going to have the
> > > > > > > event encoding and mappings in the json in the future.
> > > > > >
> > > > > > The problem with JSON based event encoding is how to deal in-case
> > > > > > we are running inside Guest/VM because Host could be anything.
> > > > > >
> > > > > > IMO, the JSON based approach is not suitable for SBI PMU. For now,
> > > > > > we either defer the work using the work queue or keep the approach
> > > > > > of this patch as-is.
> > > > > >
> > > > > > The good part about SBI PMU extension is that we do have a standard
> > > > > > set of events and we only need a way to discover supported standard
> > > > > > events with a minimum number of SBI calls. It is better to add a new
> > > > > > SBI PMU call to assist supported event discovery which will also
> > > > > > help us virtualize it.
> > > > > >
> > > > > > Regards,
> > > > > > Anup
> > > > >
> > > > > +Ian Rogers
> > > > >
> > > > > `perf list` will already filter some events depending on whether the
> > > > > PMU supports them, for example, legacy cache events. I think we can
> > > > > extend this to json events.
> > > > >
> > > >
> > > > Yes. That's what I was thinking as well. However, that may be a
> > > > problem in virtualization
> > > > as Anup pointed out.
> > > >
> > > > As per my understanding, cloud providers provide json files for VMs
> > > > based on the host
> > > > architecture and allow migration only between hosts with the same
> > > > family of cpu. In RISC-V, the mapfile.csv works based on 3 registers
> > > > indicating marchid, mimpid, and mvendorid. Thus, the json file has to
> > > > be tied with the host machine it is going to be run.
> > >
> > > This is also my understanding. That is, that cloud instances typically
> > > dedicate CPUs to VMs and don't try to present CPU models to VMs which
> > > don't exactly match the host's CPUs. The remaining concern would be if
> > > the hypervisor doesn't emulate/passthrough everything the json describes
> > > for the host CPU type.
> >
>
> x86/ARM64 kvm also can filter any event for the guest they want.
>
> > So this isn't accurate. For x86 perf uses the CPUID instruction. A
> > host operating system can change the CPUID for a guest, say pretending
> > a newer CPU model is actually an older one. This can be done when
> > migrating VMs as having the CPUID change dynamically in a guest would
> > be a problem. VM migration like this can have issues and it is fair to
> > say that it is avoided.
> >
>
> I was specifically asking if the json file is updated for a guest when migrated
> if the events supported on the destination host are different than the
> source host ?
>
> Or The VM migration across different CPUIDs (e.g different family of CPUs)
> are avoided completely. It seems the latter from your statement ?
>
>
> > Fwiw, a particular problem we have with x86 guests is the host hiding
> > CPUID leaves that describe things like the frequency of the CPU. It is
> > difficult to compute metrics in units of time when you don't know what
> > frequency cycles relates to.
> >
> > > However, this is just "typical" clouds. All bets are off for general
> > > virtualization, as Anup points out.
> > >
> > > >
> > > > We will end up doing the same if we only rely on the json file to
> > > > filter events in the future. Please let me know if the assumption is
> > > > incorrect.
> > > >
> > > > If we allow a SBI call route to discover which events are supported,
> > > > the guest can always support legacy events on any host even though it
> > > > doesn't have a json file.
> > >
> > > Yes, I think we need a solution which works even without a json file,
> > > since a VM may use a special mvendorid,marchid,mimpid triplet to
> > > describe a more generic CPU type. Unless we also create json files
> > > for these VCPUs, or provide other event discovery mechanisms, then
> > > the VMs will not have anything.
> >
>
> We have to create a json file for VMs anyways for raw events. Having
> the discovery through
> SBI call allows minimum guarantee for the perf legacy events.
>
> > I think a set of generic events is a good thing, then the PMU driver
> > can map perf's legacy events to the generic events in a clean way. I'm
> > not sure what the issue is with RISC-V guest operating systems. To
>
> The pertinent question here is how does the guest know the list of
> supported generic or perf legacy
> events as RISC-V doesn't have any standard event format/encoding
> support. There are two approaches
>
> 1. Define a new SBI interface which allows the host to let the guest
> know which events are supported at one shot.
> The perf legacy events mappings are updated at guest boot time via
> this interface. Currently, this patch achieves this
> by iterating through all the possible legacy events and making an SBI
> call one at a time during the boot.
>
> 2. Rely on the json file present (if) in the guest. In this case, all
> the supported perf legacy events must be present in
> the json. In absence of that, the driver assumes it is not supported.

I think we should define the SBI interface anyway and let hypervisors
choose their own approach. In other words, if the json file is not present
for a platform (guest/host) then kernel perf driver can rely on other
means (such as SBI interface).

>
>
> > lower overhead on x86 pass-through PMUs are being explored, that is
> > the guest operating system directly programming the CPU's performance
> > counters to avoid hypervisor traps. For this to work the triplet
> > mvendorid,marchid,mimpid should match that of the host.
> >
>
> Yes. I am tracking the pass through series for vPMU in x86.
> RISC-V also doesn't have pass through support and implements counter
> virtualization similar
> to other architectures as well but we have less number of traps
> because of the bulk access features.
>
> > The perf tool supports the notion of a standard set of legacy events
> > like instructions, cycles and certain cache events. More recently
> > we've moved to prefer sysfs and json events over these, primarily
> > because the Apple M1 ARM based Macs nobody was establishing the
> > mapping from legacy to an actual event. Users were complaining that
> > 'cycles' via a sysfs event worked, but when ARM was made like Intel
> > and prioritized legacy first, we broke it. Now we prioritize sysfs and
> > json way for all architectures and hopefully everyone is happy. The
> > last clean up for this is in:
> > https://lore.kernel.org/lkml/20240416061533.921723-1-irogers@google.com/
> >
> > Thanks,
> > Ian
> >
> > > Thanks,
> > > drew

Regards,
Anup