Perf Event support for ARMv7 (was: Re: [PATCH 5/5] arm/perfevents: implement perf event support for ARMv6)

Mon Jan 4 11:52:01 EST 2010

Hi Jean,

Thanks for this. Feedback inline.

* Jean Pihet wrote:

> Here is the updated patch after testing on HW.
> I will rebase it on Jamie's latest patch set as soon as they are out.
> 
> Feedback is welcome!

> Some remarks and questions:
> 
> 1) The number of available counters can reach 32 on ARMv7, so the macro
> ARMPMU_MAX_HWEVENTS is now defined as 32. Is that correct?

I think index 0 is reserved for the events array, so this should probably be 33.
Hopefully that doesn't break any bitmasks.

> 2) Please note that the Cortex-A9 events do not easily map to the predefined
> events. Cf. armv7_a9_perf_map and armv7_a9_perf_cache_map in the code.
> - the PERF_COUNT_HW_INSTRUCTIONS event is not found. It looks like the number
> of instructions is calculated by adding events numbers (events from 0x70 till
> 0x74: MAIN_UNIT_EXECUTED_INST, SECOND_UNIT_EXECUTED_INST,
> LD_ST_UNIT_EXECUTED_INST, FP_EXECUTED_INST and NEON_EXECUTED_INST),

Event 0x68 - `Instructions coming out of the core renaming stage'
can be used as an approximation for PERF_COUNT_HW_INSTRUCTIONS.

> - the HW_BRANCH events are not found

0x0c is HW_BRANCH_INSTRUCTIONS and 0x10 is HW_BRANCH_MISSES.
0x12 is the number of predictable branch instructions executed, so the mispredict
rate is 0x10/0x12. These events are defined for v7, so A8 should take these
definitions too.

> - the global cache events 0x50 and 0x51 define the COHERENT_LINE_HIT and
> COHERENT_LINE_MISS events, is that correct?

0x50 is COHERENT_LINE_MISS and 0x51 is COHERENT_LINE_HIT. These are only available
on A9 with SMP.

> - L1 and L2 cache events are not found. Those could be available in separate
> PL310 registers, TBC

We could use 0x01 for icache miss, 0x03 for dcache miss and 0x04 for dcache access.

> - no TLB events excepted the ITLB_MISS event are found.

I think 0x05 is DTLB_MISS.

<snip>

> +/*
> + * ARMv7 Cortex-A8 and Cortex-A9 Performance Events handling code.
> + *
> + * Copied from ARMv6 code, with the low level code inspired
> + *  by the ARMv7 Oprofile code.
> + *
> + * Cortex-A8 has up to 4 configurable performance counters and
> + *  a single cycle counter.
> + * Cortex-A9 has up to 31 configurable performance counters and
> + *  a single cycle counter.
> + *
> + * All counters can be enabled/disabled and IRQ masked separately. The cycle
> + *  counter and all 4 performance counters together can be reset separately.
> + */
> +
> +#define ARMV7_PMU_CORTEX_A8_NAME		"ARMv7 Cortex-A8"
> +
> +#define ARMV7_PMU_CORTEX_A9_NAME		"ARMv7 Cortex-A9"
> +
> +/* Common ARMv7 event types */
> +enum armv7_perf_types {
> +	ARMV7_PERFCTR_PMNC_SW_INCR		= 0x00,
> +	ARMV7_PERFCTR_IFETCH_MISS		= 0x01,
> +	ARMV7_PERFCTR_ITLB_MISS			= 0x02,
> +	ARMV7_PERFCTR_DCACHE_REFILL		= 0x03,
> +	ARMV7_PERFCTR_DCACHE_ACCESS		= 0x04,
> +	ARMV7_PERFCTR_DTLB_REFILL		= 0x05,
> +	ARMV7_PERFCTR_DREAD			= 0x06,
> +	ARMV7_PERFCTR_DWRITE			= 0x07,
> +
> +	ARMV7_PERFCTR_EXC_TAKEN			= 0x09,
> +	ARMV7_PERFCTR_EXC_EXECUTED		= 0x0A,
> +	ARMV7_PERFCTR_CID_WRITE			= 0x0B,
> +	ARMV7_PERFCTR_PC_WRITE			= 0x0C,
> +	ARMV7_PERFCTR_PC_IMM_BRANCH		= 0x0D,
> +	ARMV7_PERFCTR_UNALIGNED_ACCESS		= 0x0F,
> +	ARMV7_PERFCTR_PC_BRANCH_MIS_PRED	= 0x10,
> +	ARMV7_PERFCTR_CLOCK_CYCLES		= 0x11,
> +
> +	ARMV7_PERFCTR_PC_BRANCH_MIS_USED	= 0x12,
> +
> +	ARMV7_PERFCTR_CPU_CYCLES		= 0xFF
> +};

For consistency, I think it's better to stick to either MISS or REFILL.
Events 0x01 and 0x03 are the same, but for instructions and data respectively.

<snip>

> +static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
> +					  [PERF_COUNT_HW_CACHE_OP_MAX]
> +					  [PERF_COUNT_HW_CACHE_RESULT_MAX] = {
> +	[C(L1D)] = {
> +		/*
> +		 * The performance counters don't differentiate between read
> +		 * and write accesses/misses so this isn't strictly correct,
> +		 * but it's the best we can do. Writes and reads get
> +		 * combined.
> +		 */
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)]	= ARMV7_PERFCTR_DCACHE_ACCESS,
> +			[C(RESULT_MISS)]	= ARMV7_PERFCTR_L1_DATA_MISS,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)]	= ARMV7_PERFCTR_DCACHE_ACCESS,
> +			[C(RESULT_MISS)]	= ARMV7_PERFCTR_L1_DATA_MISS,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)]	= CACHE_OP_UNSUPPORTED,
> +			[C(RESULT_MISS)]	= CACHE_OP_UNSUPPORTED,
> +		},
> +	},

You're using the A8-only event 0x49 [ARMV7_PERFCTR_L1_DATA_MISS]. This also
counts hash misses in the address translation during the cache lookup procedure,
even if the resulting access is a [slightly slower] hit. I think you're better off
using event 0x03. In fact, I'd try to use architecturally defined events whenever
you can because that allows for comparisons between v7 cores.

> +	[C(L1I)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)]	= ARMV7_PERFCTR_L1_INST,
> +			[C(RESULT_MISS)]	= ARMV7_PERFCTR_L1_INST_MISS,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)]	= ARMV7_PERFCTR_L1_INST,
> +			[C(RESULT_MISS)]	= ARMV7_PERFCTR_L1_INST_MISS,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)]	= CACHE_OP_UNSUPPORTED,
> +			[C(RESULT_MISS)]	= CACHE_OP_UNSUPPORTED,
> +		},
> +	},

Same thing here. I'd suggest using 0x01 instead of 0x4a.

<snip>

> +/*
> + * Available counters
> + */
> +#define ARMV7_CNT0 		0	/* First event counter */
> +#define ARMV7_CCNT 		31	/* Cycle counter */
> +
> +#define ARMV7_A8_CNTMAX		5	/* Cortex-A8: up to 4 counters + CCNT */
> +#define ARMV7_A9_CNTMAX		32	/* Cortex-A9: up to 31 counters + CCNT*/

Actually, A9 has a maximum number of 6 event counters + CCNT.

<snip>

> +#define CPUID_V6_MASK   	0x7F000
> +#define CPUID_V6_BITS   	0x7B000
> +
> +#define CPUID_CORTEX_A8_BITS	0xC080
> +#define CPUID_CORTEX_A8_MASK	0xFFF0
> +
> +#define CPUID_CORTEX_A9_BITS	0xC090
> +#define CPUID_CORTEX_A9_MASK	0xFFF0

Just define CPUID_V7_MASK.

> +	unsigned long cpuid = read_cpuid_id();
> +
> +	/*
> +	 * ARMv6 detection
> +	 */
> +	if (CPUID_V6_BITS == (cpuid & CPUID_V6_MASK)) {
> +		armpmu = &armv6pmu;
> +		memcpy(armpmu_perf_cache_map, armv6_perf_cache_map,
> +			sizeof(armv6_perf_cache_map));
> +		perf_max_events	= armv6pmu.num_events;
> +	}
> +	/*
> +	 * ARMv7 detection
> +	 */
> +	else if (cpu_architecture() == CPU_ARCH_ARMv7) {
> +		/*
> +		 * Cortex-A8 detection
> +		 */
> +		if ((cpuid & CPUID_CORTEX_A8_MASK) == CPUID_CORTEX_A8_BITS) {
> +			armv7pmu.name = ARMV7_PMU_CORTEX_A8_NAME;
> +			memcpy(armpmu_perf_cache_map, armv7_a8_perf_cache_map,
> +				sizeof(armv7_a8_perf_cache_map));
> +			armv7pmu.event_map = armv7_a8_pmu_event_map;
> +			armpmu = &armv7pmu;
> +		} else
> +		/*
> +		 * Cortex-A9 detection
> +		 */
> +			if ((cpuid & CPUID_CORTEX_A9_MASK)
> +			    == CPUID_CORTEX_A9_BITS) {
> +				armv7pmu.name = ARMV7_PMU_CORTEX_A9_NAME;
> +				memcpy(armpmu_perf_cache_map,
> +					armv7_a9_perf_cache_map,
> +					sizeof(armv7_a9_perf_cache_map));
> +				armv7pmu.event_map = armv7_a9_pmu_event_map;
> +				armpmu = &armv7pmu;
> +		} else
> +			perf_max_events = -1;

The A9 code indentation is a level too deep I think. It might also be
worth adding a cpu_architecture() check to the v6 test just in case a
v7 core conflicts with the mask.

I hope that helps. Please let me know what you think about the event
numberings.

Cheers,

Will