[PATCH V2] perf: qcom: Add L3 cache PMU driver

Fri Feb 17 10:52:44 PST 2017

On Thu, Feb 16, 2017 at 05:01:19PM -0500, Agustin Vega-Frias wrote:
> On 2017-02-16 11:41, Mark Rutland wrote:
> >On Thu, Feb 16, 2017 at 10:03:05AM -0500, Agustin Vega-Frias wrote:
> >>This adds a new dynamic PMU to the Perf Events framework to program
> >>and control the L3 cache PMUs in some Qualcomm Technologies SOCs.

> >At a quick glance, this looks *awfully* similar to the L2 PMU, modulo
> >using MMIO rather than sysreg accesses. The basic shape of the hardware
> >seems the same, and the register offsets are practically identical.
> >
> >I guess that the L2 and L3 are largely the same block?
> >
> >How exactly does this relate to the L2 PMU hardware? Could you please
> >elaborate on any major differences?
> 
> The L2 and L3 blocks are separate. In the current SoC each internal
> cluster shares an L2, but all clusters in the SoC share all the L3
> slices. Logically it is seen as a single, larger L3 cache, shared by
> all CPUs in the system. In spite of this, each physical slice has its
> own PMU. Which slice serves a given memory access is determined by
> a hashing algorithm which means all CPUs might have cache lines on
> every slice.
> 
> >This has similar issues to earlier versions of the L2 PMU driver, such
> >as permitting cross-{slice,pmu} groups, and aggregating per-slice
> >events, which have been addressed in the upstreamed L2 PMU driver.
> 
> We represent this as a single perf PMU that can only be associated
> with a sigle CPU context. We do aggregation here, since logically it
> is a single L3 cache.

Collating the PMUs behind a single logical PMU is fine.

I'm specifically referring to collating events (i.e. hiding separately
controlled counters behind a single perf_event).

Since the L2 slices were cluster-affine, we could manage those on a
per-cpu basis. Here you'd either have to have a config field to select
the slice, or expose each slice as its own PMU.

> >>+static void qcom_l3_cache__64bit_counter_start(struct
> >>perf_event *event)
> >>+{
> >>+	struct l3cache_pmu *socket = to_l3cache_pmu(event->pmu);
> >>+	struct hml3_pmu *slice;
> >>+	int idx = event->hw.idx;
> >>+	u64 value = local64_read(&event->count);
> >>+
> >>+	list_for_each_entry(slice, &socket->pmus, entry) {
> >>+		hml3_pmu__counter_enable_gang(slice, idx+1);
> >>+
> >>+		if (value) {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, value >> 32);
> >>+			hml3_pmu__counter_set_value(slice, idx, (u32)value);
> >>+			value = 0;
> >>+		} else {
> >>+			hml3_pmu__counter_set_value(slice, idx+1, 0);
> >>+			hml3_pmu__counter_set_value(slice, idx, 0);
> >>+		}
> >>+
> >>+		hml3_pmu__counter_set_event(slice, idx+1, 0);
> >>+		hml3_pmu__counter_set_event(slice, idx, get_event_type(event));
> >>+
> >>+		hml3_pmu__counter_enable(slice, idx+1);
> >>+		hml3_pmu__counter_enable(slice, idx);
> >>+	}
> >>+}
> >
> >As with other PMU drivers, NAK to aggregating (separately-controlled)
> >HW events behind a single event. We should not be aggregating across
> >slices as we cannot start/stop those atomically.
> 
> In this case we start the hardware events in all slices. IOW there is a
> one-to-many relationship between perf_events in this perf PMU and events
> in the hardware PMUs.

I understand what is happening here; I'm simply disagreeing with it. We
do not permit event aggregation in this manner. Please see [1].

[...]

> >>+PMU_FORMAT_ATTR(event, "config:0-7");
> >>+PMU_FORMAT_ATTR(lc, "config:32");
> >
> >What is this 'lc' field?
> 
> It means "long counter". We have a feature that allows chaining two
> 32 bit
> hardware counters. This is how a user requests that.

Ok. This will need documentation.

[...]

> >>+static int qcom_l3_cache_pmu_offline_cpu(unsigned int cpu,
> >>struct hlist_node *n)
> >>+{
> >>+	struct l3cache_pmu *socket = hlist_entry_safe(n, struct
> >>l3cache_pmu, node);
> >>+	unsigned int target;
> >>+
> >>+	if (!cpumask_test_and_clear_cpu(cpu, &socket->cpu))
> >>+		return 0;
> >>+	target = cpumask_any_but(cpu_online_mask, cpu);
> >>+	if (target >= nr_cpu_ids)
> >>+		return 0;
> >>+	/*
> >>+	 * TODO: migrate context once core races on event->ctx have
> >>+	 * been fixed.
> >>+	 */
> >>+	cpumask_set_cpu(target, &socket->cpu);
> >>+	return 0;
> >>+}
> >
> >The event->ctx race has been fixed for a while now.
> 
> Great, I was searching for that yesterday, but could not find
> anything and
> assumed it had not been fixed, given that the CCI driver still has this
> comment in it. So it is safe to add the call to
> perf_pmu_migrate_context now?
> >
> >Please follow the example of the L2 PMU's hotplug callback, and also
> >fold the reset into the hotplug callback.
> 
> I agree we will need to do that once we have multi-socket support, but
> would you agree to keep this as-is for the time being?

If you treat the PMU as being affine to cpu_possible_mask you can follow
the same strategy as the L2 PMU driver. Even if it's not strictly
necessary, it avoids a needless difference between the two drivers.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/478424.html