[PATCH 1/2] perf stat: Fix segfault when counting armv8_pmu events

Thu Sep 24 10:14:17 EDT 2020

Hi Andi,

On 2020/9/23 3:50, Andi Kleen wrote:
> On Tue, Sep 22, 2020 at 12:23:21PM -0700, Andi Kleen wrote:
>>> After debugging, i found the root reason is that the xyarray fd is created
>>> by evsel__open_per_thread() ignoring the cpu passed in
>>> create_perf_stat_counter(), while the evsel' cpumap is assigned as the
>>> corresponding PMU's cpumap in __add_event(). Thus, the xyarray fd is created
>>> with ncpus of dummy cpumap and an out of bounds 'cpu' index will be used in
>>> perf_evsel__close_fd_cpu().
>>>
>>> To address this, add a flag to mark this situation and avoid using the
>>> affinity technique when closing/enabling/disabling events.
>>
>> The flag seems like a hack. How about figuring out the correct number of 
>> CPUs and using that?
> 
> Also would like to understand what's different on ARM64 than other architectures.
> Or could this happen on x86 too?
> 

The problem is that when the user requests per-task events, the cpumask is expected
as NULL(dummy), while the armv8_pmu do has a cpumask which inherited by evsel.
The armv8_pmu's cpumask was added for heterogeneous systems. So this issue can not
happen on x86.

In fact, the cpumask is correct indeed, but it should't be used when we requesting
per-task events. As these events should be install on all cores, i doubt how much we
can benefit from the affinity technique, so i choosed to add a flag.

I also did a test on hisilicon arm64 d06 board, with 2 sockets 128 cores.
Testing the following command 3 times, with/without the affinity technique:

time tools/perf/perf stat -ddd -C 0-127 --per-core --timeout=5000 2> /dev/null

* (HEAD detached at 7074674e7338) perf cpumap: Maintain cpumaps ordered and without dups
real	0m8.039s
user	0m0.402s
sys	0m2.582s

real	0m7.939s
user	0m0.360s
sys	0m2.560s

real	0m7.997s
user	0m0.358s
sys	0m2.586s

* (HEAD detached at 704e2f5b700d) perf stat: Use affinity for enabling/disabling events
real	0m7.954s
user	0m0.308s
sys	0m2.590s

real	0m12.959s
user	0m0.332s
sys	0m2.582s

real	0m18.009s
user	0m0.346s
sys	0m2.562s

The offcpu time is much longer when using affinity, i think that's what migration costs,
could you please share me your test case?

Thanks,
Wei