[PATCH] coresight: etm4x: Handle unreachable sink in perf mode

Tue Sep 1 13:18:14 EDT 2020

Good morning,

On Tue, Sep 01, 2020 at 11:28:55AM +0100, Suzuki K Poulose wrote:
> On 08/19/2020 08:22 PM, Mathieu Poirier wrote:
> > Hi Suzuki,
> > 
> > On Tue, Aug 18, 2020 at 08:29:31PM +0100, Suzuki K Poulose wrote:
> > > If the specified/hinted sink is not reachable from a subset of the CPUs,
> > > we could end up unable to trace the event on those CPUs. This
> > > is the best effort we could do until we support 1:1 configurations.
> > > Fail gracefully in such cases avoiding a WARN_ON, which can be easily
> > > triggered by the user on certain platforms, like :
> > > 
> > > [10919.513250] ------------[ cut here ]------------
> > > [10919.517861] WARNING: CPU: 2 PID: 24021 at
> > > drivers/hwtracing/coresight/coresight-etm-perf.c:316 etm_event_start+0xf8/0x100
> > > ...
> > > 
> > > [10919.564403] CPU: 2 PID: 24021 Comm: perf Not tainted 5.8.0+ #24
> > > [10919.570308] pstate: 80400089 (Nzcv daIf +PAN -UAO BTYPE=--)
> > > [10919.575865] pc : etm_event_start+0xf8/0x100
> > > [10919.580034] lr : etm_event_start+0x80/0x100
> > > [10919.584202] sp : fffffe001932f940
> > > [10919.587502] x29: fffffe001932f940 x28: fffffc834995f800
> > > [10919.592799] x27: 0000000000000000 x26: fffffe0011f3ced0
> > > [10919.598095] x25: fffffc837fce244c x24: fffffc837fce2448
> > > [10919.603391] x23: 0000000000000002 x22: fffffc8353529c00
> > > [10919.608688] x21: fffffc835bb31000 x20: 0000000000000000
> > > [10919.613984] x19: fffffc837fcdcc70 x18: 0000000000000000
> > > [10919.619281] x17: 0000000000000000 x16: 0000000000000000
> > > [10919.624577] x15: 0000000000000000 x14: 00000000000009f8
> > > [10919.629874] x13: 00000000000009f8 x12: 0000000000000018
> > > [10919.635170] x11: 0000000000000000 x10: 0000000000000000
> > > [10919.640467] x9 : fffffe00108cd168 x8 : 0000000000000000
> > > [10919.645763] x7 : 0000000000000020 x6 : 0000000000000001
> > > [10919.651059] x5 : 0000000000000002 x4 : 0000000000000001
> > > [10919.656356] x3 : 0000000000000000 x2 : 0000000000000000
> > > [10919.661652] x1 : fffffe836eb40000 x0 : 0000000000000000
> > > [10919.666949] Call trace:
> > > [10919.669382]  etm_event_start+0xf8/0x100
> > > [10919.673203]  etm_event_add+0x40/0x60
> > > [10919.676765]  event_sched_in.isra.134+0xcc/0x210
> > > [10919.681281]  merge_sched_in+0xb0/0x2a8
> > > [10919.685017]  visit_groups_merge.constprop.140+0x15c/0x4b8
> > > [10919.690400]  ctx_sched_in+0x15c/0x170
> > > [10919.694048]  perf_event_sched_in+0x6c/0xa0
> > > [10919.698130]  ctx_resched+0x60/0xa0
> > > [10919.701517]  perf_event_exec+0x288/0x2f0
> > > [10919.705425]  begin_new_exec+0x4c8/0xf58
> > > [10919.709247]  load_elf_binary+0x66c/0xf30
> > > [10919.713155]  exec_binprm+0x15c/0x450
> > > [10919.716716]  __do_execve_file+0x508/0x748
> > > [10919.720711]  __arm64_sys_execve+0x40/0x50
> > > [10919.724707]  do_el0_svc+0xf4/0x1b8
> > > [10919.728095]  el0_sync_handler+0xf8/0x124
> > > [10919.732003]  el0_sync+0x140/0x180
> > > 
> > > Fixes: f9d81a657bb8 ("coresight: perf: Allow tracing on hotplugged CPUs")
> > > Reported-by: Jeremy Linton <jeremy.linton at arm.com>
> > > Cc: Mathieu Poirier <mathieu.poirier at linaro.org>
> > > Cc: Mike Leach <mike.leach at linaro.org>
> > > Signed-off-by: Suzuki K Poulose <suzuki.poulose at arm.com>
> > > ---
> > >   drivers/hwtracing/coresight/coresight-etm-perf.c | 10 ++++++++++
> > >   1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
> > > index 1a3169e69bb1..9d61a71da96f 100644
> > > --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
> > > +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
> > > @@ -321,6 +321,16 @@ static void etm_event_start(struct perf_event *event, int flags)
> > >   	if (!event_data)
> > >   		goto fail;
> > > +	/*
> > > +	 * Check if this ETM is allowed to trace, as decided
> > > +	 * at etm_setup_aux(). This could be due to an unreachable
> > > +	 * sink from this ETM. We can't do much in this case if
> > > +	 * the sink was specified or hinted to the driver. For
> > > +	 * now, simply don't record anything on this ETM.
> > > +	 */
> > 
> > Can you provide more details on the scenario and the topology of the system?
> > Without either it is hard to wrap my head around the problem to address.
> > Having that information in the changelog would go a long way.
> 
> Sure. This was detected on N1SDP with the following topology, with the
> command :
> 
> 
> $ perf record -e cs_etm/@tmc_etf0/ --per-thread dd if=/dev/zero of=BIGFILE
> bs=1M count=100
> 
> 
> CPU0
>       \
>         Funnel0 ---- ETF0 --
>       /                      \
> CPU1
>                                Funnel2
> CPU2
>       \                      /
>         Funnel1 ---- ETF1 --
>       /
> CPU3
> 
> 
> Basically, a pair of CPUS (0&1, 2&3 respectively) are connected to a static
> funnel followed by a TMC-ETF, before connecting to a main
> funnel which merges the ETMs and the STM on the system.
> 
> In such a case, if the user selects tmc_etf0 as the sink for a perf
> session this could trigger a warning when starting the event on ETM2
> as we haven't been able to create a path. Also the CPU2 is cleared in
> the event_data->cpumask.

Ok, that's the kind of topology I imagined you were dealing with.

> 
> I will add the above to the commit log.

Yes please.

> 
> For now we don't really support multiple sinks for a perf session. This
> will need to be addressed for the per-CPU buffer scenario. But, we
> should fix the current logic until we get there, to avoid triggering
> the warnings, which can be done quite easily on these systems, which
> are not really per-CPU buffers.

I agree that it should fail gracefully.  

> 
> > 
> > I'm sure this is a per-thread scenario because there is more than one CPU per
> 
> Yes, it is a per-thread scenario.
> 
> > event.  I'm also suspecting this is on a system where there is one sink per CPU
> > cluster, and that is not supported.
> 
> No, that is not exactly the case, from the topology above. But not N:1
> either. More of N:M and this is possible on systems with per cluster ETFs.
> 
> > 
> > If I am right on both account I am questionning the "Fixes".  On a system with
> > N:1 topology the code introduced by f9d81a657bb8 will work in the event a CPU is
> > hotplugged in.  The code introduced in this patch is simply to prevent a
> 
> Correct. But, without the above commit, we would have failed while
> creating a path to the sink, because if a CPU was offline then we don't
> care about a path from that ETM. So this warning is essentially
> triggered only after the above commit and thus the Fixes tag.

That is an equally valid argument.

> 
> > warn_on() trace from being generated on systems that aren't supported.  It should
> > have a "stable" tag.
> 
> Cheers
> Suzuki