[PATCH v2 2/3] perf: Add driver for Arm NI-700 interconnect PMU

Fri Aug 16 11:05:45 PDT 2024

On 15/08/2024 6:04 pm, Mark Rutland wrote:
> On Wed, Jul 10, 2024 at 05:09:34PM +0100, Robin Murphy wrote:
>> The Arm NI-700 Network-on-Chip Interconnect has a relatively
>> straightforward design with a hierarchy of voltage, power, and clock
>> domains, where each clock domain then contains a number of interface
>> units and a PMU which can monitor events thereon. As such, it begets a
>> relatively straightforward driver to interface those PMUs with perf.
>>
>> Even more so than with arm-cmn, users will require detailed knowledge of
>> the wider system topology in order to meaningfully analyse anything,
>> since the interconnect itself cannot know what lies beyond the boundary
>> of each inscrutably-numbered interface. Given that, for now they are
>> also expected to refer to the NI-700 documentation for the relevant
>> event IDs to provide as well. An identifier is implemented so we can
>> come back and add jevents if anyone really wants to.
> 
> IIUC the relevant documentation would be the NI-700 TRM:
> 
>    https://developer.arm.com/documentation/101566/0203/?lang=en
> 
> ... right?

That's the one.

> [...]
> 
>> +====================================
>> +Arm Network-on Chip Interconnect PMU
>> +====================================
>> +
>> +NI-700 and friends implement a distinct PMU for each clock domain within the
>> +interconnect. Correspondingly, the driver exposes multiple PMU devices named
>> +arm_ni_<x>_cd_<y>, where <x> is an (abritrary) instance identifier and <y> is
>> +the clock domain ID within that particular instance. If multiple NI instances
>> +exist within a system, the PMU devices can be correlated with the underlying
>> +hardware instance via sysfs parentage.
> 
> I suspect that name suffixing *might* confuse userspace in some cases,
> since IIUC the perf tool tries to aggregate some_pmu_<number> instances,
> and here it would presumably aggregate all the clock domains as a
> arm_ni_<x>_cd PMU.

Indeed that is liable to be particularly funky since different domains 
may well contain different sets of interface types and thus not even 
necessarily share the same set of base events. Either way, events are 
targeted at specific interfaces so aggregation at any level would 
generally be meaningless.

> We should check how that behaves and speak to the userspace folk about
> this; taking a step back we probably need a better way for determining
> when things can or should be aggregated. >
> I assume we don't have HW to test on, but we should be able to fake up
> the sysfs hierarchy and see how "perf list" treats that.

I see there is an available FVP for the Total Compute TC2 system[1] 
which includes an NI-700 instance - it doesn't model the PMU counters, 
but the rest should work enough for discovery. I might have a go with 
that next week...

> [...]
> 
>> +static int arm_ni_validate_group(struct perf_event *event)
>> +{
>> +	struct perf_event *sibling, *leader;
>> +	struct arm_ni_val val = { 0 };
>> +
>> +	arm_ni_validate_event(event, &val);
>> +
>> +	leader = event->group_leader;
>> +	if (leader != event && !arm_ni_validate_event(leader, &val))
>> +		return - EINVAL;
>> +
>> +	for_each_sibling_event(sibling, leader) {
>> +		if (!arm_ni_validate_event(sibling, &val))
>> +			return - EINVAL;
>> +	}
>> +	return 0;
>> +}
> 
> As a trivial nit, something has gone wrong with spacing and you have:
> 
> 	return - EINVAL;
> 
> ... rather than:
> 
> 	return -EINVAL;
> 
> As a more substantial thing, this will trigger splats when lockdep is
> enabled and an event is opened where event == event->group_leader,
> because for_each_sibling_event(..., event) checks event->ctx->mutex is
> held, and event->ctx isn't initialised until pmu::event_init() returns.

Oof, I even remember looking at that report, so I don't really have any 
excuse!

> That's a latent bug in many PMU drivers at the moment, and is on my TODO
> list to fix elsewhere. For now, can you make the above:
> 
> | static int arm_ni_validate_group(struct perf_event *event)
> | {
> | 	struct perf_event *sibling, *leader = event->group_leader;
> | 	struct arm_ni_val val = { 0 };
> |
> | 	if (event == leader)
> | 		return 0;
> |
> | 	arm_ni_validate_event(event, &val);
> | 	
> | 	if (!arm_ni_validate_event(leader, &val))
> | 		return -EINVAL;
> |
> | 	for_each_sibling_event(sibling, leader) {
> | 		if (!arm_ni_validate_event(sibling, &val))
> | 			return -EINVAL;
> | 	}
> |
> | 	return 0;
> | }
> 
> ... where the early exit for the (event == leader) case will avoid the
> bad call.

Took me a moment to see why it's OK to skip "validating" the event as 
leader itself, but that's arguably just as much down to my naming. I'll 
polish that up a bit further.

> [...]
> 
>> +static int arm_ni_init_cd(struct arm_ni *ni, struct arm_ni_node *node, u64 res_start)
>> +{
>> +	struct arm_ni_cd *cd = ni->cds + node->id;
>> +	const char *name;
>> +	int err;
> 
>> +	cd->cpu = cpumask_local_spread(0, dev_to_node(ni->dev));
> 
> Can dev_to_node(ni->dev) return NUMA_NO_NODE, and if so, do we need to
> error out here? ...

It can, and cpumask_local_spread() just picks from cpu_online_mask in 
that case. It's not an error either way, the intent is just that if we 
do have NUMA, then it makes sense to try to associate with a 
physically-close CPU.

>> +static int arm_ni_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>> +{
>> +	struct arm_ni_cd *cd;
>> +	int node;
>> +
>> +	cd = hlist_entry_safe(cpuhp_node, struct arm_ni_cd, cpuhp_node);
>> +	node = dev_to_node(cd_to_ni(cd)->dev);
>> +	if (node != NUMA_NO_NODE && cpu_to_node(cd->cpu) != node && cpu_to_node(cpu) == node)
>> +		arm_ni_pmu_migrate(cd, cpu);
>> +	return 0;
>> +}
> 
> ... since we expect to handle similar here ...

...and conversely, without NUMA there's no point even asking the 
question of whether a new CPU might be more local than our current one.

>> +
>> +static int arm_ni_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>> +{
>> +	struct arm_ni_cd *cd;
>> +	unsigned int target;
>> +	int node;
>> +
>> +	cd = hlist_entry_safe(cpuhp_node, struct arm_ni_cd, cpuhp_node);
>> +	if (cpu != cd->cpu)
>> +		return 0;
>> +
>> +	node = dev_to_node(cd_to_ni(cd)->dev);
>> +	target = cpumask_any_and_but(cpumask_of_node(node), cpu_online_mask, cpu);
>> +	if (target >= nr_cpu_ids)
>> +		target = cpumask_any_but(cpu_online_mask, cpu);
>> +
>> +	if (target < nr_cpu_ids)
>> +		arm_ni_pmu_migrate(cd, target);
>> +	return 0;
>> +}
> 
> ... though not here, and overall that seems inconsistent.

I did start down that rabbit hole, but gave up in despair around the 
point of the "#ifdef CONFIG_NUMA" inside the "#ifndef CONFIG_NUMA". My 
conclusion was that the majority of cpumask_of_node() implementations do 
handle NUMA_NO_NODE, making it appear expected, and there is already 
other code assuming so, so the ones which don't were not my problem.

Thanks,
Robin.