[PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores

Tue Mar 29 07:32:14 PDT 2022

Hi Leo,

On Mon, 28 Mar 2022 21:05:47 +0800, Leo Yan wrote:
> Hi Ali,
> 
> On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > > 
> > > > > Peer core contains its local L1 cache, so I think we can set the
> > > > > memory level L1 to indicate this case.
> > > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > > caches and we don't know. 
> > > 
> > > How about set both L1 and L2 cache together for this case?
> > > 
> > > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > > better than use ANY_CACHE, at least this can help us to distinguish
> > > from other data source types if we avoid to use ANY_CACHE for all of
> > > them.
> > 
> > This seems much more confusing then the ambiguity of where the line came from
> > and is only possible with the deprecated mem_lvl enconding.  It will make
> > perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> > attribute a line to a single cache will find that the sum of the number of hits
> > is greater than the number of accesses which also seems terribly confusing.
> 
> Understand.  I considered the potential issue for
> perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
> levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
> cache levels.  We might need to extend a bit for another field
> 'mem_lvlnum'.
> 
> Agreed that there would have inaccurate issue for statistics, it's fine
> for me to use ANY_CACHE in this patch set.

Thanks!

> 
> I still think we should consider to extend the memory levels to
> demonstrate clear momory hierarchy on Arm archs, I personally like the
> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> though these cache levels are not precise like L1/L2/L3 levels, they can
> help us to map very well for the cache topology on Arm archs and without
> any confusion.  We could take this as an enhancement if you don't want
> to bother the current patch set's upstreaming.

I'd like to do this in a separate patch, but I have one other proposal. The
Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
it's also in the L2. Given that the Graviton systems and afaik the Ampere
systems don't have any cache between the L2 and the SLC, thus anything from
PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
should just set L2 for these cases? German, are you good with this for now? 

> > > > > For this data source type and below types, though they indicate
> > > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > > think this will mislead users when report the result.
> > > > 
> > > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > > found in the lower-level shared cache and the transaction wouldn't require a
> > > > cache-to-cache transfer of the modified data, so the most common case when we
> > > > source a line out of another cores cache will be if it was "modifiable" in that
> > > > cache. 
> > > 
> > > Let's still use MOSI protocol as example.  I think there have two
> > > cases: on case is the peer cache line is in 'Shared' state and another
> > > case is the peer cache line is in 'Owned' state.
> > > 
> > > Quotes the wiki page for these two cases:
> > > 
> > > "When the cache block is in the Shared (S) state and there is a
> > > snooped bus read (BusRd) transaction, then the block stays in the same
> > > state and generates no more transactions as all the cache blocks have
> > > the same value including the main memory and it is only being read,
> > > not written into."
> > > 
> > > "While in the Owner (O) state and there is a snooped read request
> > > (BusRd), the block remains in the same state while flushing (Flush)
> > > the data for the other processor to read from it."
> > > 
> > > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > > for the cache line line is in the Modified (M) state.
> > > 
> > > Again, my comment is based on the literal understanding; so please
> > > correct if have any misunderstanding at here.
> > 
> > Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> > where sharing patterns are detected," so if a line is shared among caches it
> > will likely also be in the SLC (and thus we'd get an L3 hit). 
> > 
> > If there is one copy of the cache line and that cache line needs to transition
> > from one core to another core, I don't believe it matters if it was truly
> > modified or not because the core already had permission to modify it and the
> > transaction is just as costly (ping ponging between core caches). This is the
> > one thing I really want to be able to uniquely identify as any cacheline doing
> > this that isn't a lock is a place where optimization is likely possible. 
> 
> I understood that your point that if big amount of transitions within
> multiple cores hit the same cache line, it would be very likely caused
> by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
> for easier reviewing.

And that from a dataflow perspective if the line is owned (and could be
modifiable) vs. was actually modified is really the less interesting bit. 

> Alternatively, I think it's good to pick up the patch series "perf c2c:
> Sort cacheline with all loads" [1], rather than relying on HITM tag, the
> patch series extends a new option "-d all" for perf c2c, so it displays
> the suspecious false sharing cache lines based on load/store ops and
> thread infos.  The main reason for holding on th patch set is due to we
> cannot verify it with Arm SPE at that time point, as the time being Arm
> SPE trace data was absent both store ops and data source packets.

Looking at examples I don't, at least from my system, data-source isn't set for
stores, only for loads. 

> I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
> upstream the patch series "perf c2c: Sort cacheline with all loads"
> (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
> have been merged in the mainline kernel).
> 
> If this is fine for you, I can respin the patch series for "perf c2c".
> Or any other thoughts?

I think this is a nice option to have in the tool-box, but from my point of
view, I'd like someone who is familiar with c2c output on x86 to come to an
arm64 system and be able to zero in on a ping-ponging line like they would
otherwise. Highlighting a line that is moving between cores frequently which is
likely in the exclusive state by tagging it an HITM accomplishes this and will
make it easier to find these cases.  Your approach also has innaccurancies and
wouldn't be able to differentiate between core X accessing a line a lot followed
by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
will "overcount" HITM, but I don't think this is particularly bad and it does
specifically highlight the core-2-core transfers that are likely a performance
issue easily and it will result in easier identification of areas of false or
true sharing and improve performance.

Thanks,
Ali