[RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU

Jonathan Cameron Jonathan.Cameron at Huawei.com
Fri Mar 17 07:52:32 PDT 2023


On Mon, 6 Feb 2023 14:51:44 +0800
Jie Zhan <zhanjie9 at hisilicon.com> wrote:

> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
> 
> This driver enables the usage of PMCU through the perf_event framework.
> PMCU is registered as a PMU device and utilises the AUX buffer to dump data
> directly. Users can start PMCU sampling through 'perf-record'. Event
> numbers are passed by a sysfs interface.
> 
> Signed-off-by: Jie Zhan <zhanjie9 at hisilicon.com>

Hi Jie,

A few minor comments inline.
Whilst I looked at this internally, that was a while back so I've
found a few new things to point out in what I think is a pretty good/clean driver.
The main thing here is the RFC questions you've raised in the cover letter
of course - particularly the one around mediating who has the counters between
this and the normal PMU driver.

Thanks,

Jonathan

> ---
>  drivers/perf/hisilicon/Kconfig     |   15 +
>  drivers/perf/hisilicon/Makefile    |    1 +
>  drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++
>  3 files changed, 1112 insertions(+)
>  create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
> 
> diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig
> index 171bfc1b6bc2..d7728fbe8519 100644
> --- a/drivers/perf/hisilicon/Kconfig
> +++ b/drivers/perf/hisilicon/Kconfig
> @@ -24,3 +24,18 @@ config HNS3_PMU
>  	  devices.
>  	  Adds the HNS3 PMU into perf events system for monitoring latency,
>  	  bandwidth etc.
> +
> +config HISI_PMCU
> +	tristate "HiSilicon PMCU"
> +	depends on ARM64 && PID_IN_CONTEXTIDR
> +	help
> +	  Support for HiSilicon Performance Monitor Control Unit (PMCU).
> +	  HiSilicon Performance Monitor Control Unit (PMCU) is a device that
> +	  offloads PMU accesses from CPUs, handling the configuration, event
> +	  switching, and counter reading of core PMUs on Kunpeng SoC. It
> +	  facilitates fine-grained and multi-PMU-event CPU profiling, in which
> +	  scenario the current 'perf' scheme may lose events or drop sampling
> +	  frequency. With PMCU, users can reliably obtain the data of up to 240
> +	  PMU events with the sample interval of events down to 1ms, while the
> +	  software overhead of accessing PMUs, as well as its impact on target
> +	  workloads, is reduced.
> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> index 4d2c9abe3372..93e4e6f2816a 100644
> --- a/drivers/perf/hisilicon/Makefile
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
>  
>  obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
>  obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
> +obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o
> diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c
> new file mode 100644
> index 000000000000..6ec5d6c31e1f
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_pmcu.c
> @@ -0,0 +1,1096 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * HiSilicon Performance Monitor Control Unit (PMCU) driver
> + *
> + * Copyright (C) 2022 HiSilicon Limited
> + * Author: Jie Zhan <zhanjie9 at hisilicon.com>
> + */
> +
> +#include <linux/acpi.h>

Not seeing this used. Probably want mod_devicetable.h that
includes the struct acpi_device_id definition.

> +#include <linux/bitfield.h>
> +#include <linux/bits.h>
> +#include <linux/cpumask.h>
> +#include <linux/delay.h>
> +#include <linux/dev_printk.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/errno.h>
> +#include <linux/gfp_types.h>

It is very very rare for this to be included directly.
Normally just rely on indirect inclusion from slab.h or similar.
I would drop this one.

> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/mm_types.h>
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/printk.h>

property.h

> +#include <linux/slab.h>
> +#include <linux/smp.h>
> +#include <linux/threads.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/cputype.h>
> +#include <asm/sysreg.h>
> +
> +/* Registers */
> +#define HISI_PMCU_REG_FSM_STATUS	0x0000
> +#define HISI_PMCU_REG_FSM_CFG		0x0004
> +#define HISI_PMCU_REG_EVENT_BASE_H	0x0008
> +#define HISI_PMCU_REG_EVENT_BASE_L	0x000C
> +#define HISI_PMCU_REG_KILL_BASE_H	0x0010
> +#define HISI_PMCU_REG_KILL_BASE_L	0x0014
> +#define HISI_PMCU_REG_STORE_BASE_H	0x0018
> +#define HISI_PMCU_REG_STORE_BASE_L	0x001C
> +#define HISI_PMCU_REG_WAIT_CNT		0x0020
> +#define HISI_PMCU_REG_FSM_CTRL		0x0038
> +#define HISI_PMCU_REG_FSM_BRK		0x003C
> +#define HISI_PMCU_REG_COMP		0x0044
> +#define HISI_PMCU_REG_INT_EN		0x0100
> +#define HISI_PMCU_REG_INT_MSK		0x0104
> +#define HISI_PMCU_REG_INT_STAT		0x0108
> +#define HISI_PMCU_REG_INT_CLR		0x010C
> +#define HISI_PMCU_REG_PMCR		0x0200
> +#define HISI_PMCU_REG_PMCCFILTR		0x0204
> +
> +/* Register related configs */
> +#define HISI_PMCU_FSM_CFG_EV_LEN_MSK	GENMASK(7, 0)
> +#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK	GENMASK(15, 8)
> +#define HISI_PMCU_FSM_CFG_NR_PMU_MSK	GENMASK(19, 16)
> +#define HISI_PMCU_FSM_CFG_MAX_EV_LEN	240

As this is used in various places that are only loosely assocated
with this register, I'd just rename it HISI_PMCU_MAX_EVN_LEN.
Similar probably applies to some of these others.

> +#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP	255
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU	8
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C	5
> +#define HISI_PMCU_WAIT_CNT_DEFAULT	0x249F0
> +#define HISI_PMCU_FSM_CTRL_TRIGGER	BIT(0)
> +#define HISI_PMCU_FSM_BRK_BRK		BIT(0)
> +#define HISI_PMCU_COMP_HPMN_THR		3
> +#define HISI_PMCU_COMP_ENABLE		BIT(0)
> +#define HISI_PMCU_INT_DONE		BIT(0)
> +#define HISI_PMCU_INT_BRK		BIT(1)
> +#define HISI_PMCU_INT_ALL		GENMASK(1, 0)
> +#define HISI_PMCU_PMCR_DEFAULT		0xC1

How is this related to the architecture defined PMCR register?
Or just a coincidence of naming?

Either way, I'm assuming 0xC1 is probably multiple fields so if
possible can we break this down further with defines to show
where the value comes from.


> +#define HISI_PMCU_PMCCFILTR_MSK		GENMASK(31, 24)
...

> +/**
> + * struct hisi_pmcu_events - PMCU events and sampling configuration
> + * @nr_pmu:		number of core PMU counters that run in parallel
> + * @padding:		number of padding events in a sample
> + * @nr_ev:		number of events passed by users in a sample
> + * @nr_ev_per_sample:	number of events passed to hardware for a sample
> + *			This equals nr_ev + padding and should be evenly
> + *			divisible by nr_pmu.
> + * @max_sample_loop:	max number of samples that can be done in a loop
> + * @ev_len:		event length for hardware to read in a loop
> + * @nr_loop:		number of loops in one trigger
> + * @comp_mode:		compatibility mode
> + * @nr_sample:		number of samples that the current trigger takes
> + * @nr_pending_sample:	number of pending samples
> + * @subsample_size:	size of a subsample
> + * @sample_size:	size of a sample
> + * @output_size:	size of output from one trigger
> + * @sample_period:	sample period passed to hardware
> + * @nr_cpu:		number of hardware threads (logical CPUs)
> + * @events:		event IDs passed from users

Maybe say what they are for rather than where they come from?
event IDs to sample.

> + */
> +struct hisi_pmcu_events {
> +	u8 nr_pmu;
> +	u8 padding;
> +	u8 nr_ev;
> +	u8 nr_ev_per_sample;
> +	u8 max_sample_loop;
> +	u8 ev_len;
> +	u8 nr_loop;
> +	u8 comp_mode;

Could you use the enum hisi_pmcu_comp_mode type for this?

> +	u32 nr_sample;
> +	u32 nr_pending_sample;
> +	u32 subsample_size;
> +	u32 sample_size;
> +	u32 output_size;
> +	u32 sample_period;
> +	u32 nr_cpu;
> +	u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
> +};
> +

...


> +static const struct attribute_group hisi_pmcu_format_attr_group = {
> +	.name = "format",
> +	.attrs = hisi_pmcu_format_attrs,
> +};
> +
> +static ssize_t monitored_cpus_show(struct device *dev,
> +				   struct device_attribute *attr, char *buf)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +
> +	return sysfs_emit(buf, "%d-%d\n",
> +			  cpumask_first(&hisi_pmcu->cpus),
> +			  cpumask_last(&hisi_pmcu->cpus));

What does this do about offline CPUs?
Should it include them or not?

> +}
> +
> +static DEVICE_ATTR_ADMIN_RO(monitored_cpus);



> +static ssize_t user_events_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
> +	u32 head, tail, nr_ev;
> +	char *line;
> +	int err;
> +
> +	line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL);

Doesn't seem to be freed anywhere.

> +	nr_ev = 0;
> +	head = 0;
> +	tail = 0;
> +	while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) {
> +		while (head < count && isspace(buf[head]))
> +			head++;
> +		if (!isxdigit(buf[head]))
> +			break;
> +		tail = head + 1;
> +
> +		while (tail < count && isalnum(buf[tail]))
> +			tail++;
> +
> +		strncpy(line, buf + head, tail - head);
> +		line[tail - head] = '\0';
> +		err = kstrtou16(line, 16, &user_ev->ev[nr_ev]);
> +		if (err) {
> +			user_ev->nr_ev = 0;
> +			return err;
> +		}
> +		nr_ev++;
> +		head = tail;
> +	}
> +	user_ev->nr_ev = nr_ev;
> +
> +	return count;
> +}
> +
> +static int hisi_pmcu_pmu_event_init(struct perf_event *event)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 cfg;
> +	u32 val;
> +
> +	if (event->attr.type != hisi_pmcu->pmu.type)
> +		return -ENOENT;
> +
> +	if (hisi_pmcu->busy)
> +		return -EBUSY;
> +
> +	cfg = event->attr.config;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg);

val gets used for a lot of different things in this function.
I would use as set of new local variables with names that make it more
obvious what they are.

> +	ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT;
local variable isn't that useful here and makes the whole reuse of value
issue worse.

	ev->nr_pending_sample = FIELD_GET(...);
	if (ev->nr_pending_sample == 0)
		ev->nr_pending_sample = HISI...

> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg);
> +	if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) {
> +		dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n",
> +			HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS);
> +		return -EINVAL;
> +	}
> +	ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT :
> +				  HISI_PMCU_WAIT_CNT_DEFAULT;
> +
> +	cfg = event->attr.config1;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg);
> +	val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val);
> +	writel(val, base + HISI_PMCU_REG_PMCCFILTR);
> +
> +	return 0;
> +}
> +

...

> +static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu,
> +				      struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 addr, end;
> +	u32 val;
> +
> +	/* FSM CFG */
> +	val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu);
> +	writel(val, base + HISI_PMCU_REG_FSM_CFG);
> +
> +	/* Sample period */
> +	writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT);
> +
> +	/* Event ID base */
> +	addr = virt_to_phys(ev->events);
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_H);

No point in using the local variable val here that I can see.
	writel(upper_32_bits(addr), base + ...)

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_L);
same with this one.

> +
> +	/* sbuf end */
> +	end = page_to_phys(sbuf->page) + sbuf->size;
> +
> +	/* Data output address */
> +	addr = end - sbuf->remain;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_H);
and this one.

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_L);
and another. etc..

> +
> +	/* Stop data output if sbuf end is reached (abnormally) */
> +	addr = end;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_H);
> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_L);
> +
> +	/* Trigger */
> +	writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL);
> +}
> +

> +
> +static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev,
> +					    struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_auxtrace_header header;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	u32 *data;
> +	u32 sz;
> +
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +
> +	header.buffer_size = sbuf->size;
> +	header.nr_pmu = ev->nr_pmu;
> +	header.nr_cpu = ev->nr_cpu;
> +	header.comp_mode = ev->comp_mode;
> +	header.subsample_size = ev->subsample_size;
> +	header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu;
> +	header.nr_event = ev->nr_ev_per_sample;

Might be nicer to read as as

	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
	struct hisi_pmcu_auxtrace_header header = {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};

		
> +
> +	data = page_to_virt(sbuf->page);
> +	memcpy(data, &header, sizeof(header));
> +	memcpy(data + sizeof(header) / sizeof(*data), ev->events,
> +	       ev->nr_ev_per_sample * sizeof(u32));

I'm not sure why data is a u32 *
A few things that would make this neater.

* write the header directly.

	struct hisi_pmcu_auxtrace_header *header = page_to_virt(sbuf->page);

	*header = (struct hisi_pmcu_auxtrace_header) {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};
* Use header + 1 to get to the address just after it.
	memcpy(header + 1, ev->events, ev->nr_ev_per_sample * sizeof(u32));

* Instead add teh data to the structure (maybe rename it)
struct hisi_pmcu_auxtrace_header {
	u32 buffer_size;
	u32 nr_pmu;
	u32 nr_cpu;
	u32 comp_mode;
	u32 subsample_size;
	u32 nr_subsample_per_sample;
	u32 nr_event;
	u32 data[];
};

> +
> +	sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32);

with above augmented header structure, struct_size()
 
> +	sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN);
> +
> +	sbuf->remain -= sz;
> +}
> +
> +static void hisi_pmcu_pmu_start(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct perf_output_handle *handle = &hisi_pmcu->handle;
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev,
> +			 "Sampling is running, pmu->start() ignored\n");

I'm not sure on perf convention on this, but I'd have though dev_dbg
enough for this.  If this is normal thing to do then feel free to leave it.


> +		goto out;
> +	}
> +
...


> +
> +static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct perf_output_handle *handle;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	handle = &hisi_pmcu->handle;
> +
> +	/* If PMCU is running, break it */
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n");

Is this useful?  dev_dbg maybe?

> +		err = hisi_pmcu_hw_sample_stop(hisi_pmcu);
> +		if (err)
> +			dev_err(hisi_pmcu->dev,
> +				"Timed out for stopping PMCU!\n");
> +	}
> +
> +	buf = perf_get_aux(handle);
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +	perf_aux_output_end(handle, sbuf->size - sbuf->remain);
> +
> +	spin_unlock(&hisi_pmcu->lock);
> +
> +	hwc->state |= PERF_HES_STOPPED;
> +	perf_event_update_userpage(event);
> +}

...

> +static int hisi_pmcu_init_data(struct platform_device *pdev,
> +			       struct hisi_pmcu *hisi_pmcu)
> +{
> +	int ret;
> +
> +	hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0);
> +	if (IS_ERR(hisi_pmcu->regbase))
> +		return dev_err_probe(&pdev->dev, -ENODEV,
> +				     "Failed to map device register space\n");
> +
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
> +				       &hisi_pmcu->scclid);

These need linux/property.h to be included (mentioned above)

> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read sccl-id!\n");
> +
> +	/*
> +	 * Obtain the number of CPUs that contributes to the sample size.
> +	 * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain
> +	 * number of CPUs in a cluster regardless of how many CPUs are actually
> +	 * implemented/available.
> +	 */
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster",
> +				       &hisi_pmcu->ev.nr_cpu);
> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read nr-cluster!\n");
> +	hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER;
> +
> +	return 0;
> +}




> +static struct platform_driver hisi_pmcu_driver = {
> +	.driver = {
> +		.name = HISI_PMCU_DRV_NAME,
> +		.acpi_match_table = hisi_pmcu_acpi_match,
> +		/*
> +		 * Unbinding driver is not yet supported as we have not worked
> +		 * out a safe bind/unbind process.

I'd add a note on this to the cover letter and the above
patch description.  It's definitely something we should
make work for this driver.

> +		 */
> +		.suppress_bind_attrs = true,
> +	},
> +	.probe = hisi_pmcu_probe,
> +};
> +
..

I'm very interested to see what this hardware/driver gets used for.

Thanks,

Jonathan



More information about the linux-arm-kernel mailing list