[RESEND v4] perf stat: Support per-cluster aggregation

Tue Feb 6 23:25:44 PST 2024

On 2024/2/7 1:49, Ian Rogers wrote:
> On Tue, Feb 6, 2024 at 12:24 AM Yicong Yang <yangyicong at huawei.com> wrote:
>>
>> From: Yicong Yang <yangyicong at hisilicon.com>
>>
>> Some platforms have 'cluster' topology and CPUs in the cluster will
>> share resources like L3 Cache Tag (for HiSilicon Kunpeng SoC) or L2
>> cache (for Intel Jacobsville). Currently parsing and building cluster
>> topology have been supported since [1].
>>
>> perf stat has already supported aggregation for other topologies like
>> die or socket, etc. It'll be useful to aggregate per-cluster to find
>> problems like L3T bandwidth contention.
>>
>> This patch add support for "--per-cluster" option for per-cluster
>> aggregation. Also update the docs and related test. The output will
>> be like:
>>
>> [root at localhost tmp]# perf stat -a -e LLC-load --per-cluster -- sleep 5
>>
>>  Performance counter stats for 'system wide':
>>
>> S56-D0-CLS158    4      1,321,521,570      LLC-load
>> S56-D0-CLS594    4        794,211,453      LLC-load
>> S56-D0-CLS1030    4             41,623      LLC-load
>> S56-D0-CLS1466    4             41,646      LLC-load
>> S56-D0-CLS1902    4             16,863      LLC-load
>> S56-D0-CLS2338    4             15,721      LLC-load
>> S56-D0-CLS2774    4             22,671      LLC-load
>> [...]
>>
>> On a legacy system without cluster or cluster support, the output will
>> be look like:
>> [root at localhost perf]# perf stat -a -e cycles --per-cluster -- sleep 1
>>
>>  Performance counter stats for 'system wide':
>>
>> S56-D0-CLS0   64         18,011,485      cycles
>> S7182-D0-CLS0   64         16,548,835      cycles
>>
>> Note that this patch doesn't mix the cluster information in the outputs
>> of --per-core to avoid breaking any tools/scripts using it.
>>
>> Note that perf recently supports "--per-cache" aggregation, but it's not
>> the same with the cluster although cluster CPUs may share some cache
>> resources. For example on my machine all clusters within a die share the
>> same L3 cache:
>> $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
>> 0-31
>> $ cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
>> 0-3
>>
>> [1] commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die")
>> Tested-by: Jie Zhan <zhanjie9 at hisilicon.com>
>> Reviewed-by: Tim Chen <tim.c.chen at linux.intel.com>
>> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
> 
> Reviewed-by: Ian Rogers <irogers at google.com>
> 

Thanks!

>> ---
>> Change since v3:
>> - Rebase on v6.7-rc4 and resolve the conflicts
>> Link: https://lore.kernel.org/all/20230404104951.27537-1-yangyicong@huawei.com/
>>
>> Change since v2:
>> - Use 0 as cluster ID on legacy system without cluster support, keep consistenct
>>   with what --per-die does.
>> Link: https://lore.kernel.org/all/20230328112717.19573-1-yangyicong@huawei.com/
>>
>> Change since v1:
>> - Provides the information about how to map the cluster to the CPUs in the manual
>> - Thanks the review from Tim and test from Jie.
>> Link: https://lore.kernel.org/all/20230313085911.61359-1-yangyicong@huawei.com/
>>
>>  tools/perf/Documentation/perf-stat.txt        | 11 ++++
>>  tools/perf/builtin-stat.c                     | 52 +++++++++++++++++--
>>  .../tests/shell/lib/perf_json_output_lint.py  |  4 +-
>>  tools/perf/tests/shell/lib/stat_output.sh     | 12 +++++
>>  tools/perf/tests/shell/stat+csv_output.sh     |  2 +
>>  tools/perf/tests/shell/stat+json_output.sh    | 13 +++++
>>  tools/perf/tests/shell/stat+std_output.sh     |  2 +
>>  tools/perf/util/cpumap.c                      | 32 +++++++++++-
>>  tools/perf/util/cpumap.h                      | 19 +++++--
>>  tools/perf/util/env.h                         |  1 +
>>  tools/perf/util/stat-display.c                | 13 +++++
>>  tools/perf/util/stat.h                        |  1 +
>>  12 files changed, 153 insertions(+), 9 deletions(-)
>>
[...]
>> diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
>> index 0581ee0fa5f2..5907456d42a2 100644
>> --- a/tools/perf/util/cpumap.c
>> +++ b/tools/perf/util/cpumap.c
>> @@ -222,6 +222,8 @@ static int aggr_cpu_id__cmp(const void *a_pointer, const void *b_pointer)
>>                 return a->socket - b->socket;
>>         else if (a->die != b->die)
>>                 return a->die - b->die;
>> +       else if (a->cluster != b->cluster)
>> +               return a->cluster - b->cluster;
>>         else if (a->cache_lvl != b->cache_lvl)
>>                 return a->cache_lvl - b->cache_lvl;
>>         else if (a->cache != b->cache)
>> @@ -309,6 +311,29 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
>>         return id;
>>  }
>>
>> +int cpu__get_cluster_id(struct perf_cpu cpu)
>> +{
>> +       int value, ret = cpu__get_topology_int(cpu.cpu, "cluster_id", &value);
> 
> nit: normal coding style is for a whitespace newline here.
> 

Will fix. I may referred to cpu__get_core_id() below which is an exception.

Thanks,
Yicong