[PATCH v4 0/5] A mechanism for efficient support for per-function metrics
Mark Barnett
mark.barnett at arm.com
Fri Apr 11 04:07:47 PDT 2025
On 4/9/25 12:38, Ingo Molnar wrote:
>
> * mark.barnett at arm.com <mark.barnett at arm.com> wrote:
>
>> From: Mark Barnett <mark.barnett at arm.com>
>>
>> This patch introduces the concept of an alternating sample rate to perf
>> core and provides the necessary basic changes in the tools to activate
>> that option.
>>
>> The primary use case for this change is to be able to enable collecting
>> per-function performance metrics using the Arm PMU, as per the following
>> approach:
>>
>> * Starting with a simple periodic sampling (hotspot) profile,
>> augment each sample with PMU counters accumulated over a short window
>> up to the point the sample was taken.
>> * For each sample, perform some filtering to improve attribution of
>> the accumulated PMU counters (ensure they are attributed to a single
>> function)
>> * For each function accumulate a total for each PMU counter so that
>> metrics may be derived.
>>
>> Without modification, and sampling at a typical rate associated
>> with hotspot profiling (~1mS) leads to poor results. Such an
>> approach gives you a reasonable estimation of where the profiled
>> application is spending time for relatively low overhead, but the
>> PMU counters cannot easily be attributed to a single function as the
>> window over which they are collected is too large. A modern CPU may
>> execute many millions of instructions over many thousands of functions
>> within 1mS window. With this approach, the per-function metrics tend
>> to trend to some average value across the top N functions in the
>> profile.
>>
>> In order to ensure a reasonable likelihood that the counters are
>> attributed to a single function, the sampling window must be rather
>> short; typically something in the order of a few hundred cycles proves
>> well as tested on a range of aarch64 Cortex and Neoverse cores.
>>
>> As it stands, it is possible to achieve this with perf using a very high
>> sampling rate (e.g ~300cy), but there are at least three major concerns
>> with this approach:
>>
>> * For speculatively executing, out of order cores, can the results be
>> accurately attributed to a give function or the given sample window?
>> * A short sample window is not guaranteed to cover a single function.
>> * The overhead of sampling every few hundred cycles is very high and
>> is highly likely to cause throttling which is undesirable as it leads
>> to patchy results; i.e. the profile alternates between periods of
>> high frequency samples followed by longer periods of no samples.
>>
>> This patch does not address the first two points directly. Some means
>> to address those are discussed on the RFC v2 cover letter. The patch
>> focuses on addressing the final point, though happily this approach
>> gives us a way to perform basic filtering on the second point.
>>
>> The alternating sample period allows us to do two things:
>>
>> * We can control the risk of throttling and reduce overhead by
>> alternating between a long and short period. This allows us to
>> decouple the "periodic" sampling rate (as might be used for hotspot
>> profiling) from the short sampling window needed for collecting
>> the PMU counters.
>> * The sample taken at the end of the long period can be otherwise
>> discarded (as the PMU data is not useful), but the
>> PERF_RECORD_CALLCHAIN information can be used to identify the current
>> function at the start of the short sample window. This is useful
>> for filtering samples where the PMU counter data cannot be attributed
>> to a single function.
>>
>> There are several reasons why it is desirable to reduce the overhead and
>> risk of throttling:
>>
>> * PMU counter overflow typically causes an interrupt into the kernel;
>> this affects program runtime, and can affect things like branch
>> prediction, cache locality and so on which can skew the metrics.
>> * The very high sample rate produces significant amounts of data.
>> Depending on the configuration of the profiling session and machine,
>> it is easily possible to produce many orders of magnitude more data
>> which is costly for tools to post-process and increases the chance
>> of data loss. This is especially relevant on larger core count
>> systems where it is very easy to produce massive recordings.
>> Whilst the kernel will throttle such a configuration,
>> which helps to mitigate a large portion of the bandwidth and capture
>> overhead, it is not something that can be controlled for on a per
>> event basis, or for non-root users, and because throttling is
>> controlled as a percentage of time its affects vary from machine to
>> machine. AIUI throttling may also produce an uneven temporal
>> distribution of samples. Finally, whilst throttling does a good job
>> at reducing the overall amount of data produced, it still leads to
>> much larger captures than with this method; typically we have
>> observed 1-2 orders of magnitude larger captures.
>>
>> This patch set modifies perf core to support alternating between two
>> sample_period values, providing a simple and inexpensive way for tools
>> to separate out the sample window (time over which events are
>> counted) from the sample period (time between interesting samples).
>
> Upstreaming path:
> =================
>
> So, while this looks interesting and it might work, a big problem as I
> see it is to get tools to use it: the usual kernel feature catch-22.
>
> So I think a hard precondition for an upstream merge would be for the
> usage of this new ABI to be built into 'perf top/record' and used by
> default, so the kernel side code gets tested and verified - and our
> default profiling output would improve rather substantially as well.
>
> ABI details:
> ============
>
> I'd propose a couple of common-sense extensions to the ABI:
>
> 1)
>
> I think a better approach would be to also batch the short periods,
> i.e. instead of interleaved long-short periods:
>
> L S L S L
>
> we'd support batches of short periods:
>
> L SSSS L SSSS L SSSS L SSSS
>
> As long as the long periods are 'long enough', throttling wouldn't
> (or, at least, shouldn't) trigger. (If throttling triggers, it's the
> throttling code that needs to be fixed.)
>
> This means that your proposed ABI would also require an additional
> parameter: [long,short,batch-count]. Your current proposal is basically
> [long,short,1].
>
> Advantages of batching the short periods (let's coin it
> 'burst-profiling'?) would be:
>
> - Performance: the caching of the profiling machinery, which would
> reduce the per-short-sample overhead rather substantially I believe.
> With your current approach we bring all that code into CPU caches
> and use it 1-2 times for a single data record, which is kind of a
> waste.
>
> - Data quality: batching increases the effective data rate of
> 'relevant' short samples, with very little overall performance
> impact. By tuning the long-period and the batch length the overall
> tradeoff between profiling overhead and amount of data extracted can
> be finetuned pretty well IMHO. (Tools might even opt to discard the
> first 'short' sample to decouple it from the first cache-cold
> execution of the perf machinery.)
>
> 2)
>
> I agree with the random-jitter approach as well, to remove short-period
> sampling artifacts that may arise out of the period length resonating
> with the execution time of key code sequences, especially in the 2-3
> digits long integers sampling period spectrum, but maybe it should be
> expressed in terms of a generic period length, not as a random 4-bit
> parameter overlaid on another parameter.
>
> Ie. the ABI would be something like:
>
> [period_long, period_short, period_jitter, batch_count]
>
> I see no reason why the random jitter has to be necessarily 4 bits
> short, and it could apply to the 'long' periods as well. Obviously this
> all complicates the math on the tooling side a bit. ;-)
>
> If data size is a concern: there's no real need to save space all that
> much on the perf_attr ABI side: it's a setup/configuration structure,
> not a per sample field where every bit counts.
>
> Implementation:
> ===============
>
> Instead of making it an entirely different mode, we could allow
> period_long to be zero, and map regular periodic events to
> [0,period_short,0,1], or so? But only if that simplifies/unifies the
> code.
>
> Summary:
> ========
>
> Anyway, would something like this work for you? I think the most
> important aspect is to demonstrate working tooling side. Good thing
> we have tools/perf/ in-tree for exactly such purposes. ;-)
>
> Thanks,
>
> Ingo
Thanks, Ingo, for the detailed notes. Your feedback is very much
appreciated.
Tool Integration
==================
We've been using a python script to process the data into a report. We
can look at implementing this directly in perf report, if that is
required. However, I'm nervous about making the new feature the default
behaviour for the tool.
This feature has been integrated into our tools [1] for the last 12
months, and has received a lot of testing on Arm Neoverse hardware.
Other platforms have received less rigorous testing. In my opinion, more
work would be needed to validate the PMU hardware & software
characteristics of other architectures before this can be made the default.
Burst Sampling
================
I like the burst sampling idea. Increased I-Cache pressure is an
inherent weakness of this sampling method, and this would help to
alleviate that somewhat. I'll add this in the next spin.
Period Jitter
===============
Yes, we can apply this to both periods. I will make that change.
I'm not sure I've fully understood your suggestion here. In its current
state, the 4-bit jitter field acts as a base-2 exponent. This gives us a
random jitter value of up to 2**15. Is the suggestion to change this to
a fixed, absolute value that can be applied to both long & short periods?
Thanks,
Mark
[1]
https://developer.arm.com/documentation/109847/9-3/Overview-of-Streamline-CLI-Tools
More information about the linux-arm-kernel
mailing list