[PATCH v2] perf/core: Add support for PMUs that can be read from more than 1 CPU

Mon Mar 5 04:17:02 PST 2018

On Fri, Mar 02, 2018 at 05:14:53PM -0800, Saravana Kannan wrote:
> Some PMUs events can be read from more than the one CPU. So allow the
> PMU driver to mark events as such. For these events, we don't need to
> reject reads or make smp calls to the event's CPU (and cause
> unnecessary overhead and wake ups).
> 
> When a PMU driver marks an event as such, care must be taken by the
> driver to make sure they can handle the event being read/updated from
> more than 1 CPU at the same time (Eg: due to an IRQ indicating event
> counter overflow and another thread trying to read the latest values).
> 
> Good examples of such events would be events from caches shared across
> CPUs.
> 
> Signed-off-by: Saravana Kannan <skannan at codeaurora.org>
> ---
> Changes since v1:
> - Use cpumasks instead of capability flag as that's more flexible.
> 
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c       | 14 +++++++++-----
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 7546822..4cec431 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -629,6 +629,7 @@ struct perf_event {
>  
>  	int				oncpu;
>  	int				cpu;
> +	cpumask_t			readable_on_cpus;

For most PMUs, this will be emptry, and it's potentially *very* large
(e.g. on systems where NR_CPUS is 4096). Please use a poitner to a mask,
as I suggested in [1], e.g.

	cpumask_t			*read_mask;

That way, PMUs which already maintain an affinity mask can share that
between all of their events.

PMUs with PERF_EV_CAP_READ_ACTIVE_PKG can be updated to flip that mask
in pmu::add() and pmu::del(). I assume there are existing sibling masks
we can use. That means we can remove PERF_EV_CAP_READ_ACTIVE_PKG
entriely...

>  	struct list_head		owner_entry;
>  	struct task_struct		*owner;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5d3df58..1a8fbfa 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3483,10 +3483,12 @@ struct perf_read_data {
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>  	u16 local_pkg, event_pkg;
> +	int local_cpu = smp_processor_id();
>  
> -	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
> -		int local_cpu = smp_processor_id();
> +	if (cpumask_test_cpu(local_cpu, &event->readable_on_cpus))
> +		return local_cpu;
>  
> +	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
>  		event_pkg = topology_physical_package_id(event_cpu);
>  		local_pkg = topology_physical_package_id(local_cpu);

... and this would simplify down to:

static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
{
	int local_cpu = smp_processor_id();

	if (event->read_mask && cpumask_test_cpu(local_cpu, event->read_mask)
		return local_cpu;

	return event_cpu;
}

> @@ -3575,7 +3577,8 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
>  {
>  	unsigned long flags;
>  	int ret = 0;
> -
> +	int local_cpu = smp_processor_id();
> +	bool readable = cpumask_test_cpu(local_cpu, &event->readable_on_cpus);
>  	/*
>  	 * Disabling interrupts avoids all counter scheduling (context
>  	 * switches, timer based rotation and IPIs).
> @@ -3600,7 +3603,8 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
>  
>  	/* If this is a per-CPU event, it must be for this CPU */
>  	if (!(event->attach_state & PERF_ATTACH_TASK) &&
> -	    event->cpu != smp_processor_id()) {
> +	    event->cpu != local_cpu &&
> +	    !readable) {
>  		ret = -EINVAL;
>  		goto out;
>  	}
> @@ -3610,7 +3614,7 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
>  	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
>  	 * oncpu == -1).
>  	 */
> -	if (event->oncpu == smp_processor_id())
> +	if (event->oncpu == smp_processor_id() || readable)
>  		event->pmu->read(event);

Please explain why you need to change perf_event_read_local().

Is there a case where you have numbers to show that
perf_event_read_local() is a bottleneck? If so, please elaborate.

As-is, this doesn't seem right.

Thanks,
Mark.

[1] https://lkml.kernel.org/r/20171128124534.3jvuala525wvn64r@wfg-t540p.sh.intel.com