[PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
Jonathan Cameron
jonathan.cameron at huawei.com
Thu Feb 19 02:10:43 PST 2026
On Wed, 18 Feb 2026 14:58:06 +0000
Besar Wicaksono <bwicaksono at nvidia.com> wrote:
> Adds PCIE-TGT PMU support in Tegra410 SOC. This PMU is
> instanced in each root complex in the SOC and it captures
> traffic originating from any source towards PCIE BAR and CXL
> HDM range. The traffic can be filtered based on the
> destination root port or target address range.
>
> Reviewed-by: Ilkka Koskinen <ilkka at os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono at nvidia.com>
+CC same group as on previous.
No additional comments from me, I just left the convent for those
I +CC.
J
> ---
> .../admin-guide/perf/nvidia-tegra410-pmu.rst | 76 +++++
> drivers/perf/arm_cspmu/nvidia_cspmu.c | 323 ++++++++++++++++++
> 2 files changed, 399 insertions(+)
>
> diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> index 8528685ddb61..07dc447eead7 100644
> --- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> @@ -7,6 +7,7 @@ metrics like memory bandwidth, latency, and utilization:
>
> * Unified Coherence Fabric (UCF)
> * PCIE
> +* PCIE-TGT
>
> PMU Driver
> ----------
> @@ -211,6 +212,11 @@ Example usage:
>
> perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
>
> +.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
> +
> +Mapping the RC# to lspci segment number
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
> Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
> for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
> @@ -266,3 +272,73 @@ Example output::
> 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
> 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
> 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
> +
> +PCIE-TGT PMU
> +------------
> +
> +The PCIE-TGT PMU monitors traffic targeting PCIE BAR and CXL HDM ranges.
> +There is one PCIE-TGT PMU per PCIE root complex (RC) in the SoC. Each RC in
> +Tegra410 SoC can have up to 16 lanes that can be bifurcated into up to 8 root
> +ports (RP). The PMU provides RP filter to count PCIE BAR traffic to each RP and
> +address filter to count access to PCIE BAR or CXL HDM ranges. The details
> +of the filters are described in the following sections.
> +
> +Mapping the RC# to lspci segment number is similar to the PCIE PMU.
> +Please see :ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
> +
> +The events and configuration options of this PMU device are available in sysfs,
> +see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
> +
> +The events in this PMU can be used to measure bandwidth and utilization:
> +
> + * rd_req: count the number of read requests to PCIE.
> + * wr_req: count the number of write requests to PCIE.
> + * rd_bytes: count the number of bytes transferred by rd_req.
> + * wr_bytes: count the number of bytes transferred by wr_req.
> + * cycles: counts the PCIE cycles.
> +
> +The average bandwidth is calculated as::
> +
> + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
> + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
> +
> +The average request rate is calculated as::
> +
> + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
> + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
> +
> +The PMU events can be filtered based on the destination root port or target
> +address range. Filtering based on RP is only available for PCIE BAR traffic.
> +Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
> +found in sysfs, see
> +/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
> +
> +Destination filter settings:
> +
> +* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
> + corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
> + only available for PCIE BAR traffic.
> +* dst_addr_base: BAR or CXL HDM filter base address.
> +* dst_addr_mask: BAR or CXL HDM filter address mask.
> +* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
> + address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
> + the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
> + to determine if the traffic destination address falls within the filter range::
> +
> + (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
> +
> + If the comparison succeeds, then the event will be counted.
> +
> +If the destination filter is not specified, the RP filter will be configured by default
> +to count PCIE BAR traffic to all root ports.
> +
> +Example usage:
> +
> +* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
> +
> + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
> +
> +* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
> + 0x10000 to 0x100FF on socket 0's PCIE RC-1::
> +
> + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index 42f11f37bddf..25c408b56dc8 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> @@ -42,6 +42,24 @@
> #define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0)
> #define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST
>
> +#define NV_PCIE_TGT_PORT_COUNT 8ULL
> +#define NV_PCIE_TGT_EV_TYPE_CC 0x4
> +#define NV_PCIE_TGT_EV_TYPE_COUNT 3ULL
> +#define NV_PCIE_TGT_EV_TYPE_MASK GENMASK_ULL(NV_PCIE_TGT_EV_TYPE_COUNT - 1, 0)
> +#define NV_PCIE_TGT_FILTER2_MASK GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT, 0)
> +#define NV_PCIE_TGT_FILTER2_PORT GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT - 1, 0)
> +#define NV_PCIE_TGT_FILTER2_ADDR_EN BIT(NV_PCIE_TGT_PORT_COUNT)
> +#define NV_PCIE_TGT_FILTER2_ADDR GENMASK_ULL(15, NV_PCIE_TGT_PORT_COUNT)
> +#define NV_PCIE_TGT_FILTER2_DEFAULT NV_PCIE_TGT_FILTER2_PORT
> +
> +#define NV_PCIE_TGT_ADDR_COUNT 8ULL
> +#define NV_PCIE_TGT_ADDR_STRIDE 20
> +#define NV_PCIE_TGT_ADDR_CTRL 0xD38
> +#define NV_PCIE_TGT_ADDR_BASE_LO 0xD3C
> +#define NV_PCIE_TGT_ADDR_BASE_HI 0xD40
> +#define NV_PCIE_TGT_ADDR_MASK_LO 0xD44
> +#define NV_PCIE_TGT_ADDR_MASK_HI 0xD48
> +
> #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
>
> #define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
> @@ -186,6 +204,15 @@ static struct attribute *pcie_v2_pmu_event_attrs[] = {
> NULL,
> };
>
> +static struct attribute *pcie_tgt_pmu_event_attrs[] = {
> + ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
> + ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
> + ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
> + ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
> + ARM_CSPMU_EVENT_ATTR(cycles, NV_PCIE_TGT_EV_TYPE_CC),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_event_attrs[] = {
> ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> NULL,
> @@ -239,6 +266,15 @@ static struct attribute *pcie_v2_pmu_format_attrs[] = {
> NULL,
> };
>
> +static struct attribute *pcie_tgt_pmu_format_attrs[] = {
> + ARM_CSPMU_FORMAT_ATTR(event, "config:0-2"),
> + ARM_CSPMU_FORMAT_ATTR(dst_rp_mask, "config:3-10"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_en, "config:11"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_base, "config1:0-63"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_mask, "config2:0-63"),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_format_attrs[] = {
> ARM_CSPMU_FORMAT_EVENT_ATTR,
> ARM_CSPMU_FORMAT_FILTER_ATTR,
> @@ -478,6 +514,267 @@ static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
> return 0;
> }
>
> +struct pcie_tgt_addr_filter {
> + u32 refcount;
> + u64 base;
> + u64 mask;
> +};
> +
> +struct pcie_tgt_data {
> + struct pcie_tgt_addr_filter addr_filter[NV_PCIE_TGT_ADDR_COUNT];
> + void __iomem *addr_filter_reg;
> +};
> +
> +#if defined(CONFIG_ACPI)
> +static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
> +{
> + int ret;
> + struct acpi_device *adev;
> + struct pcie_tgt_data *data;
> + struct list_head resource_list;
> + struct resource_entry *rentry;
> + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
> + struct device *dev = cspmu->dev;
> +
> + data = devm_kzalloc(dev, sizeof(struct pcie_tgt_data), GFP_KERNEL);
> + if (!data)
> + return -ENOMEM;
> +
> + adev = arm_cspmu_acpi_dev_get(cspmu);
> + if (!adev) {
> + dev_err(dev, "failed to get associated PCIE-TGT device\n");
> + return -ENODEV;
> + }
> +
> + INIT_LIST_HEAD(&resource_list);
> + ret = acpi_dev_get_memory_resources(adev, &resource_list);
> + if (ret < 0) {
> + dev_err(dev, "failed to get PCIE-TGT device memory resources\n");
> + acpi_dev_put(adev);
> + return ret;
> + }
> +
> + rentry = list_first_entry_or_null(
> + &resource_list, struct resource_entry, node);
> + if (rentry) {
> + data->addr_filter_reg = devm_ioremap_resource(dev, rentry->res);
> + ret = 0;
> + }
> +
> + if (IS_ERR(data->addr_filter_reg)) {
> + dev_err(dev, "failed to get address filter resource\n");
> + ret = PTR_ERR(data->addr_filter_reg);
> + }
> +
> + acpi_dev_free_resource_list(&resource_list);
> + acpi_dev_put(adev);
> +
> + ctx->data = data;
> +
> + return ret;
> +}
> +#else
> +static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
> +{
> + return -ENODEV;
> +}
> +#endif
> +
> +static struct pcie_tgt_data *pcie_tgt_get_data(struct arm_cspmu *cspmu)
> +{
> + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
> +
> + return ctx->data;
> +}
> +
> +/* Find the first available address filter slot. */
> +static int pcie_tgt_find_addr_idx(struct arm_cspmu *cspmu, u64 base, u64 mask,
> + bool is_reset)
> +{
> + int i;
> + struct pcie_tgt_data *data = pcie_tgt_get_data(cspmu);
> +
> + for (i = 0; i < NV_PCIE_TGT_ADDR_COUNT; i++) {
> + if (!is_reset && data->addr_filter[i].refcount == 0)
> + return i;
> +
> + if (data->addr_filter[i].base == base &&
> + data->addr_filter[i].mask == mask)
> + return i;
> + }
> +
> + return -ENODEV;
> +}
> +
> +static u32 pcie_tgt_pmu_event_filter(const struct perf_event *event)
> +{
> + u32 filter;
> +
> + filter = (event->attr.config >> NV_PCIE_TGT_EV_TYPE_COUNT) &
> + NV_PCIE_TGT_FILTER2_MASK;
> +
> + return filter;
> +}
> +
> +static bool pcie_tgt_pmu_addr_en(const struct perf_event *event)
> +{
> + u32 filter = pcie_tgt_pmu_event_filter(event);
> +
> + return FIELD_GET(NV_PCIE_TGT_FILTER2_ADDR_EN, filter) != 0;
> +}
> +
> +static u32 pcie_tgt_pmu_port_filter(const struct perf_event *event)
> +{
> + u32 filter = pcie_tgt_pmu_event_filter(event);
> +
> + return FIELD_GET(NV_PCIE_TGT_FILTER2_PORT, filter);
> +}
> +
> +static u64 pcie_tgt_pmu_dst_addr_base(const struct perf_event *event)
> +{
> + return event->attr.config1;
> +}
> +
> +static u64 pcie_tgt_pmu_dst_addr_mask(const struct perf_event *event)
> +{
> + return event->attr.config2;
> +}
> +
> +static int pcie_tgt_pmu_validate_event(struct arm_cspmu *cspmu,
> + struct perf_event *new_ev)
> +{
> + u64 base, mask;
> + int idx;
> +
> + if (!pcie_tgt_pmu_addr_en(new_ev))
> + return 0;
> +
> + /* Make sure there is a slot available for the address filter. */
> + base = pcie_tgt_pmu_dst_addr_base(new_ev);
> + mask = pcie_tgt_pmu_dst_addr_mask(new_ev);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
> + if (idx < 0)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static void pcie_tgt_pmu_config_addr_filter(struct arm_cspmu *cspmu,
> + bool en, u64 base, u64 mask, int idx)
> +{
> + struct pcie_tgt_data *data;
> + struct pcie_tgt_addr_filter *filter;
> + void __iomem *filter_reg;
> +
> + data = pcie_tgt_get_data(cspmu);
> + filter = &data->addr_filter[idx];
> + filter_reg = data->addr_filter_reg + (idx * NV_PCIE_TGT_ADDR_STRIDE);
> +
> + if (en) {
> + filter->refcount++;
> + if (filter->refcount == 1) {
> + filter->base = base;
> + filter->mask = mask;
> +
> + writel(lower_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
> + writel(upper_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
> + writel(lower_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
> + writel(upper_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
> + writel(1, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
> + }
> + } else {
> + filter->refcount--;
> + if (filter->refcount == 0) {
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
> +
> + filter->base = 0;
> + filter->mask = 0;
> + }
> + }
> +}
> +
> +static void pcie_tgt_pmu_set_ev_filter(struct arm_cspmu *cspmu,
> + const struct perf_event *event)
> +{
> + bool addr_filter_en;
> + int idx;
> + u32 filter2_val, filter2_offset, port_filter;
> + u64 base, mask;
> +
> + filter2_val = 0;
> + filter2_offset = PMEVFILT2R + (4 * event->hw.idx);
> +
> + addr_filter_en = pcie_tgt_pmu_addr_en(event);
> + if (addr_filter_en) {
> + base = pcie_tgt_pmu_dst_addr_base(event);
> + mask = pcie_tgt_pmu_dst_addr_mask(event);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
> +
> + if (idx < 0) {
> + dev_err(cspmu->dev,
> + "Unable to find a slot for address filtering\n");
> + writel(0, cspmu->base0 + filter2_offset);
> + return;
> + }
> +
> + /* Configure address range filter registers.*/
> + pcie_tgt_pmu_config_addr_filter(cspmu, true, base, mask, idx);
> +
> + /* Config the counter to use the selected address filter slot. */
> + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_ADDR, 1U << idx);
> + }
> +
> + port_filter = pcie_tgt_pmu_port_filter(event);
> +
> + /* Monitor all ports if no filter is selected. */
> + if (!addr_filter_en && port_filter == 0)
> + port_filter = NV_PCIE_TGT_FILTER2_PORT;
> +
> + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_PORT, port_filter);
> +
> + writel(filter2_val, cspmu->base0 + filter2_offset);
> +}
> +
> +static void pcie_tgt_pmu_reset_ev_filter(struct arm_cspmu *cspmu,
> + const struct perf_event *event)
> +{
> + bool addr_filter_en;
> + u64 base, mask;
> + int idx;
> +
> + addr_filter_en = pcie_tgt_pmu_addr_en(event);
> + if (!addr_filter_en)
> + return;
> +
> + base = pcie_tgt_pmu_dst_addr_base(event);
> + mask = pcie_tgt_pmu_dst_addr_mask(event);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, true);
> +
> + if (idx < 0) {
> + dev_err(cspmu->dev,
> + "Unable to find the address filter slot to reset\n");
> + return;
> + }
> +
> + pcie_tgt_pmu_config_addr_filter(cspmu, false, base, mask, idx);
> +}
> +
> +static u32 pcie_tgt_pmu_event_type(const struct perf_event *event)
> +{
> + return event->attr.config & NV_PCIE_TGT_EV_TYPE_MASK;
> +}
> +
> +static bool pcie_tgt_pmu_is_cycle_counter_event(const struct perf_event *event)
> +{
> + u32 event_type = pcie_tgt_pmu_event_type(event);
> +
> + return event_type == NV_PCIE_TGT_EV_TYPE_CC;
> +}
> +
> enum nv_cspmu_name_fmt {
> NAME_FMT_GENERIC,
> NAME_FMT_SOCKET,
> @@ -622,6 +919,30 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> .reset_ev_filter = nv_cspmu_reset_ev_filter,
> }
> },
> + {
> + .prodid = 0x10700000,
> + .prodid_mask = NV_PRODID_MASK,
> + .name_pattern = "nvidia_pcie_tgt_pmu_%u_rc_%u",
> + .name_fmt = NAME_FMT_SOCKET_INST,
> + .template_ctx = {
> + .event_attr = pcie_tgt_pmu_event_attrs,
> + .format_attr = pcie_tgt_pmu_format_attrs,
> + .filter_mask = 0x0,
> + .filter_default_val = 0x0,
> + .filter2_mask = NV_PCIE_TGT_FILTER2_MASK,
> + .filter2_default_val = NV_PCIE_TGT_FILTER2_DEFAULT,
> + .get_filter = NULL,
> + .get_filter2 = NULL,
> + .init_data = pcie_tgt_init_data
> + },
> + .ops = {
> + .is_cycle_counter_event = pcie_tgt_pmu_is_cycle_counter_event,
> + .event_type = pcie_tgt_pmu_event_type,
> + .validate_event = pcie_tgt_pmu_validate_event,
> + .set_ev_filter = pcie_tgt_pmu_set_ev_filter,
> + .reset_ev_filter = pcie_tgt_pmu_reset_ev_filter,
> + }
> + },
> {
> .prodid = 0,
> .prodid_mask = 0,
> @@ -714,6 +1035,8 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
>
> /* NVIDIA specific callbacks. */
> SET_OP(validate_event, impl_ops, match, NULL);
> + SET_OP(event_type, impl_ops, match, NULL);
> + SET_OP(is_cycle_counter_event, impl_ops, match, NULL);
> SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
> SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
> SET_OP(reset_ev_filter, impl_ops, match, NULL);
More information about the linux-arm-kernel
mailing list