[PATCH] psci: add debugfs for runtime instrumentation
Lorenzo Pieralisi
lorenzo.pieralisi at arm.com
Mon Jun 19 04:25:08 PDT 2017
On Fri, Jun 16, 2017 at 04:26:20PM +0200, Vincent Guittot wrote:
> Hi Mark,
>
> On 16 June 2017 at 13:47, Mark Rutland <mark.rutland at arm.com> wrote:
> > Hi Vincent,
> >
> > On Fri, Jun 16, 2017 at 10:23:40AM +0200, Vincent Guittot wrote:
> >> Add debugfs to display runtime instrumentation results that can be collected
> >> by PSCI firmware.
> >
> > Could you elaborate on this please? e.g.
>
> yes sorry. I should have given more details about it
>
> >
> > * What is instrumented?
>
> The ARM Trusted Firwware that implements PSCI, also has a Performance
> Measurement Framework (PMF) that can be used to save some timetamps
> when running psci firmware. On top of PMF, serveral services have been
> implemented like the PSCI residency stats and the runtime
> instrumentation service. The latter saves timestamp when a PSCI
> service is entered and then leaved or when cache maintenance is
> executed durign psci call. With these timestamps, we can measure the
> duration of some part of code of the PSCI firmware.
>
> > * What values can be derived from this?
>
> The time needed to execute the firmware part of the PSCI and extract
> some min, average and max duration to run some PSCI services
>
>
> > * How is this exposed by the driver? (e.g. trace, summation, average?)
>
> The firmware only exposed some timestamps and this patch compute the
> result to extract some latency information
>
>
> > * What is the format of the files? (shouldn't there be documentation?)
> > * Who is this useful for, and when?
>
> The debugfs provides a snapshot of the timestamps with latency current entry
>
> And some statistics per CPU with latency_summary entry . I have put
> below an example:
> #cat /sys/kernel/debug/psci/latency_summary
> CPU0 step min avg max count
> enter 12502 12502 313396 18665
> leave 20837 20837 298393 18665
> cache 4167 4167 256718 18665
> CPU1 step min avg max count
> enter 13336 13336 425085 18381
> leave 21671 21671 308395 18381
> cache 4167 4167 370074 18381
> CPU2 step min avg max count
> enter 12502 12502 267553 18349
> leave 21671 21671 265053 18349
> cache 4167 4167 220877 18349
>
> Such informations are useful when working on cpuidle and power
> management and when studying the cpu entry/wakeup latency
>
> > * On which platforms is this available, and when?
>
> The code that save the timestamp is generic and as result available
> for all platforms that implement PSCI. Nevertheless, each platfrom has
> to enable the SMC service. At now, juno and hikey has enable the
> service and hikey960 should some soon as well
>
> >
> > Is this gathered on each call, periodically, or lazily, when requested
> > by the user?
>
> At boot, everything is disabled. User has to enable the gathering of
> the timestamp and the computation of the statistics.
> When enable, the statistics are updated on every call to psci_cpu_suspend_enter
>
> >
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot at linaro.org>
> >> ---
> >> drivers/firmware/Kconfig | 8 ++
> >> drivers/firmware/psci.c | 366 +++++++++++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 374 insertions(+)
> >>
> >> diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
> >> index 6e4ed5a..4431504 100644
> >> --- a/drivers/firmware/Kconfig
> >> +++ b/drivers/firmware/Kconfig
> >> @@ -8,6 +8,14 @@ menu "Firmware Drivers"
> >> config ARM_PSCI_FW
> >> bool
> >>
> >> +config ARM_PSCI_INSTRUMENTATION
> >> + bool "ARM PSCI Run Time Instrumentation"
> >> + depends on ARM_PSCI_FW && CPU_IDLE && DEBUG_FS
> >> + help
> >> + Enable debugfs interface for PSCI FW in which you can retrieve
> >> + runtime instrumentation values of the PSCI firmware like time to
> >> + enter or leave a cpu suspend state or time to flush cache.
> >
> > Looking at the latest published spec (ARM DEN 0022D), PSCI doesn't
> > provide such information, as there's only PSCI_STAT_{COUNT,RESIDENCY}.
>
> This use the same Performance Measurement Framework as
> PSCI_STAT_{COUNT,RESIDENCY}
Except that as Mark pointed out PSCI_STAT_{COUNT, RESIDENCY} are
standard calls and part of the PSCI specification and the rest of the
PMF is implementation defined firmware that is mostly useful for
debugging...firmware :) so why do we even have to bother adding a kernel
debugs layer to read those stats ?
Anyway, I am happy to implement PSCI v1.0 calls to retrieve
PSCI_STAT_{COUNT, RESIDENCY} statistics to add to CPU_IDLE kernel
stats, as for this patch it does not implement anything that is part of
the PSCI specification so there is not much else I can comment on I am
afraid.
Thanks,
Lorenzo
> > What mechanisms are you using here?
>
> More details can be founded here
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/firmware-design.md#13--performance-measurement-framework
>
> >
> > As far as I can tell, these are not part of the PSCI standard, and I'm
> > not aware of any documentation.
> >
> > I'm very much not keen on using IMPLEMENTATION DEFINED functionality in
> > the PSCI driver, as IMP DEF stuff always creates long-term maintenance
> > problems. You'll need to convince me that this is worth supporting.
>
> I will try.
>
> I have put it in psci.c as it is really about measurement latenccy of
> psci firmware but it can probably be put somewhere else if it is more
> relevant. The only link between current psci driver and this feature
> is the call to psci_update_all_rt_instr() in psci_cpu_suspend_enter()
> to make sure that we have coherent timestamp
> Thinking a bit more about this, we can probably use the cpu_pm_exit
>
> >
> > [...]
> >
> >> +#ifdef CONFIG_ARM_PSCI_INSTRUMENTATION
> >> +#include <linux/clocksource.h>
> >> +#include <linux/debugfs.h>
> >> +#include <linux/seq_file.h>
> >> +
> >> +#define PMF_SMC_GET_TIMESTAMP_64 0xC2000010
> >
> > What is PMF?
>
> Performance Monitoring Framework
>
> >
> > Per the SMCCC, this ID is an atomic {SMC,HVC}64 SiP call.
> >
> > I'm really not keen on using IMP DEF SiP calls like this. These are not
> > part of PSCI.
> >
> >> +
> >> +#define PMF_ARM_TIF_IMPL_ID 0x41
> >> +
> >> +#define PMF_RT_INSTR_SVC_ID 1
> >> +
> >> +#define RT_INSTR_ENTER_PSCI 0
> >> +#define RT_INSTR_EXIT_PSCI 1
> >> +#define RT_INSTR_ENTER_HW_LOW_PWR 2
> >> +#define RT_INSTR_EXIT_HW_LOW_PWR 3
> >> +#define RT_INSTR_ENTER_CFLUSH 4
> >> +#define RT_INSTR_EXIT_CFLUSH 5
> >> +#define RT_INSTR_TOTAL_IDS 6
> >> +
> >> +#define PMF_TID_SHIFT 0
> >> +#define PMF_TID_MASK (0xFF << PMF_TID_SHIFT)
> >> +#define PMF_SVC_ID_SHIFT 10
> >> +#define PMF_SVC_ID_MASK (0x3F << PMF_SVC_ID_SHIFT)
> >> +#define PMF_IMPL_ID_SHIFT 24
> >> +#define PMF_IMPL_ID_MASK (0xFFU << PMF_IMPL_ID_SHIFT)
> >> +
> >> +struct rt_inst_stats {
> >> + unsigned long min;
> >> + unsigned long max;
> >> + unsigned long avg;
> >> + unsigned long count;
> >> +};
> >> +
> >> +static struct dentry *psci_debugfs_dir;
> >> +
> >> +/* meaningful instrumentation name */
> >> +char *rt_inst_name[RT_INSTR_TOTAL_IDS/2] = {
> >> + "enter",
> >> + "leave",
> >> + "cache"
> >> +};
> >> +
> >> +/* timstamp index to use to compute the delta */
> >> +int rt_inst_delta_idx[RT_INSTR_TOTAL_IDS/2][2] = {
> >> + /* Start --> End */
> >> + {RT_INSTR_ENTER_PSCI, RT_INSTR_ENTER_HW_LOW_PWR},
> >> + {RT_INSTR_EXIT_HW_LOW_PWR, RT_INSTR_EXIT_PSCI},
> >> + {RT_INSTR_ENTER_CFLUSH, RT_INSTR_EXIT_CFLUSH}
> >> +};
> >> +
> >> +DEFINE_PER_CPU(struct rt_inst_stats[RT_INSTR_TOTAL_IDS/2], rt_inst_summary);
> >> +int rt_track_enable;
> >> +
> >> +u32 rt_inst_shift;
> >> +u32 rt_inst_mult = 1;
> >> +
> >> +static bool psci_rt_instr_enable(void)
> >> +{
> >> + int ret, tid;
> >> + int cpu_mpidr = cpu_logical_map(0);
> >> + tid |= PMF_RT_INSTR_SVC_ID << PMF_SVC_ID_SHIFT | RT_INSTR_ENTER_PSCI;
> >
> > What exactly is this tid parameter?
>
> This tid selects which timestamp to get among the list
>
> >
> >> +
> >> + arm_smccc_smc(PMF_SMC_GET_TIMESTAMP_64, tid, cpu_mpidr,
> >> + 0, 0, 0, 0, 0, &res);
> >> +
> >> + ret = psci_to_linux_errno(res.a0);
> >
> > The above was not a PSCI call. Does it follow the same return
> > conventions?
>
> No, psci smc call function only returns a0. In this case of runtime
> instrumentation, a0 is the error code and a1 is the timestamp
>
> >
> > How big's the returnned timestamp?
>
> it's a 64bit value
>
> >
> > How's a big value disambiguated from a negative error code?
>
> a0 is error code and a1 is the timestamp. the function
> psci_rt_instr_enable() just try to make get a timestamp to check that
> the service is supported by the firmwareso in this case we don't care
> of a1 but we do in psci_get_one_rt_instr_cpu
>
> >
> >> +
> >> + if (ret)
> >> + return false;
> >> +
> >> + return true;
> >> +}
> >> +
> >> +static long long psci_get_one_rt_instr_cpu(int id, int cpu, int cache)
> >> +{
> >> + int ret, tid;
> >> + int cpu_mpidr = cpu_logical_map(cpu);
> >> + struct arm_smccc_res res;
> >> +
> >> + tid = PMF_ARM_TIF_IMPL_ID << PMF_IMPL_ID_SHIFT;
> >> + tid |= PMF_RT_INSTR_SVC_ID << PMF_SVC_ID_SHIFT | id;
> >> +
> >> + arm_smccc_smc(PMF_SMC_GET_TIMESTAMP_64, tid, cpu_mpidr, cache,
> >> + 0, 0, 0, 0, &res);
> >
> > What exaclty does this "cache" parameter do?
>
> When set, this parameter forces a dcache invalidate in the PSCI
> firmware before reading the timestamp
>
> >
> >> +
> >> + ret = psci_to_linux_errno(res.a0);
> >> +
> >> + if (ret)
> >> + return (long long)(ret);
> >
> > Surely the implicit conversion to the return type is sufficient?
>
> yes probably. I tend to explicitly cast in order to make the
> conversion easier to identify when reading the code
>
> >
> >> +
> >> + return res.a1;
> >> +}
> >> +
> >> +static int psci_get_all_rt_instr_cpu(int cpu, int cache, long long timestamp[])
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS; i++) {
> >> + timestamp[i] = psci_get_one_rt_instr_cpu(i, cpu, cache);
> >> + if (timestamp[i] < 0)
> >> + return (int)(timestamp[i]);
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static void psci_update_all_rt_instr(int cpu, int cache)
> >> +{
> >> + long long tmp_rt_inst[RT_INSTR_TOTAL_IDS];
> >> + long tmp_delta[RT_INSTR_TOTAL_IDS/2];
> >> + struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
> >> + long long start_ts;
> >> + int i;
> >> +
> >> + /* Get all RT statistics of the cpu*/
> >> + if (psci_get_all_rt_instr_cpu(cpu, cache, tmp_rt_inst))
> >> + /* An error happens while getting timestamp */
> >> + return;
> >> +
> >> + /* Get start time */
> >> + start_ts = tmp_rt_inst[0];
> >> +
> >> + /* Compute delta time */
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
> >> + long long start = tmp_rt_inst[rt_inst_delta_idx[i][0]];
> >> + long long end = tmp_rt_inst[rt_inst_delta_idx[i][1]];
> >> +
> >> + /*
> >> + * If time is before start time, discard the value which come
> >> + * from a previous call
> >> + */
> >
> > I don't follow this comment.
>
> start_ts = tmp_rt_inst[0] is the first timestamp of a PSCI call so if
> another timestamp is before, this probably mean that it comes from
> another call like cache maintenance timestamp not updated for this
> psci call
>
> I will update the comment to make it clearer
>
> >
> >> + if ((start >= start_ts) && (end >= start))
> >> + tmp_delta[i] = end - start;
> >> + else
> >> + tmp_delta[i] = -1;
> >> + }
> >> +
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
> >> + if (tmp_delta[i] < 0)
> >> + continue;
> >> +
> >> + if (tmp_delta[i] > my_stats[i].max) {
> >> + /* New max value */
> >> + my_stats[i].max = tmp_delta[i];
> >> + }
> >> + if (tmp_delta[i] < my_stats[i].min) {
> >> + /* New min value */
> >> + my_stats[i].min = tmp_delta[i];
> >> + }
> >> + }
> >> +
> >> + /* Update avg */
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
> >> + unsigned long avg;
> >> +
> >> + if (tmp_delta[i] < 0)
> >> + continue;
> >> +
> >> + avg = my_stats[i].avg*my_stats[i].count + tmp_delta[i];
> >> +
> >> + my_stats[i].avg = avg / (my_stats[i].count + 1);
> >> +
> >> + my_stats[i].count++;
> >> + }
> >> +}
> >> +
> >> +static void psci_update_all_rt_instr_this_cpu(int cache)
> >> +{
> >> + if (!rt_track_enable)
> >> + return;
> >> +
> >> + psci_update_all_rt_instr(raw_smp_processor_id(), cache);
> >> +}
> >> +
> >> +/* Show latency summary all CPUs */
> >> +static int psci_summary_show(struct seq_file *s, void *data)
> >> +{
> >> + int cpu, i, ret = 0;
> >> +
> >> + for_each_online_cpu(cpu) {
> >> + struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
> >> +
> >> + seq_printf(s, "CPU%-2d step min avg max count\n", cpu);
> >> +
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
> >> + seq_printf(s, " %5s %7lu %7lu %7lu %7lu\n",
> >> + rt_inst_name[i],
> >> + (my_stats[i].min * rt_inst_mult) >> rt_inst_shift,
> >> + (my_stats[i].avg * rt_inst_mult) >> rt_inst_shift,
> >> + (my_stats[i].max * rt_inst_mult) >> rt_inst_shift,
> >> + my_stats[i].count);
> >
> > If we race with a CPU updating its stats, we could see junk here.
>
> we should not see junk but may be some partial results like updated
> value of min, max, avg updated but the count is not updated yet.
>
> >
> >> + }
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int psci_summary_open(struct inode *inode, struct file *file)
> >> +{
> >> + return single_open(file, psci_summary_show, NULL);
> >> +}
> >> +
> >> +static const struct file_operations psci_summary_fops = {
> >> + .open = psci_summary_open,
> >> + .read = seq_read,
> >> + .llseek = seq_lseek,
> >> + .release = single_release,
> >> +};
> >> +
> >> +
> >> +/* Enable tracking runtime stats*/
> >> +static int psci_enable_show(struct seq_file *s, void *data)
> >> +{
> >> + seq_printf(s, "%d\n", rt_track_enable);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int psci_enable_open(struct inode *inode, struct file *file)
> >> +{
> >> + return single_open(file, psci_enable_show, NULL);
> >> +}
> >> +
> >> +static ssize_t psci_enable_store(struct file *file,
> >> + const char __user *user_buf, size_t count, loff_t *ppos)
> >> +{
> >> + unsigned long val;
> >> + int i, cpu, ret;
> >> +
> >> + ret = kstrtoul_from_user(user_buf, count, 10, &val);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + rt_track_enable = !!val;
> >> +
> >> + if (rt_track_enable)
> >> + for_each_online_cpu(cpu) {
> >> + struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
> >> +
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
> >> + my_stats[i].min = -1;
> >> + my_stats[i].max = 0;
> >> + my_stats[i].avg = 0;
> >> + my_stats[i].count = 0;
> >> + }
> >> + }
> >> +
> >> + return count;
> >> +}
> >> +
> >> +static const struct file_operations psci_enable_fops = {
> >> + .open = psci_enable_open,
> >> + .write = psci_enable_store,
> >> + .read = seq_read,
> >> + .llseek = seq_lseek,
> >> + .release = single_release,
> >> +};
> >> +
> >> +/* Show current timestamp of all CPUs */
> >> +static int psci_current_show(struct seq_file *s, void *data)
> >> +{
> >> + int cpu, i, ret = 0;
> >> +
> >> + seq_puts(s, "CPU INSTR ID Timestamp\n");
> >> +
> >> + for_each_online_cpu(cpu)
> >> + for (i = 0; i < RT_INSTR_TOTAL_IDS; i++) {
> >> + unsigned long long ts;
> >> +
> >> + ts = (psci_get_one_rt_instr_cpu(i, cpu, 1) * rt_inst_mult)
> >> + >> rt_inst_shift,
> >> + seq_printf(s, "CPU%-2d INSTR%d timestamp %llu\n",
> >> + cpu, i, ts);
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int psci_current_open(struct inode *inode, struct file *file)
> >> +{
> >> + return single_open(file, psci_current_show, NULL);
> >> +}
> >> +
> >> +static const struct file_operations psci_current_fops = {
> >> + .open = psci_current_open,
> >> + .read = seq_read,
> >> + .llseek = seq_lseek,
> >> + .release = single_release,
> >> +};
> >> +
> >> +static int __init psci_debug_init(void)
> >> +{
> >> + u64 sec = GENMASK_ULL((56) - 1, 0);
> >
> > It's fine to write 55 directly.
>
> yes
>
> >
> > Please note the the timer is guaranteed to be *at least* 56 bits, not
> > *exactly* 56 bits, so you may need additional masking.
> >
> >> + u32 freq;
> >> +
> >> + /* Test if runtime instrumentation is enable */
> >> + if (!psci_rt_instr_enable()) {
> >
> > This is an unconditional initcall, so this simply is not safe.
>
> you're right, it should be called only if/when psci driver has been probed
>
> >
> > The platform may not have FW, or no secure world.
> >
> > On systems without a full SMCCC implementation (e.g. many PSCI 0.1
> > implementations), this will also be bad.
> >
> >> + pr_info("PSCI RunTime Instrumentation not supported\n");
> >> + return 0;
> >> + }
> >> +
> >> + /*
> >> + * PSCI instrumentation uses same timer ip as arm_arch_timer to count
> >> + * clock cycle so we reuse some clocksource helpers to translate cycle
> >> + * into ns
> >> + */
> >> +
> >> + /* Get counter frequency */
> >> + freq = read_sysreg(cntfrq_el0);
> >
> > On some platforms, CNTFRQ is wrong.
> >
> > I really don't like tying this into the arch timer driver.
>
> Not sure to catch your concern ? we don't touch the arch timer and in
> fact we don't even need it but only reuse some helpers function of
> clocksource to convert cycle into ns.
> On kernel side, we only get timestamps from the smc call and this
> timestamp are then converted in ns but we don't use anything directly
> from arm_arch_timer
>
> Regards,
> Vincent
>
> >
> > Thanks,
> > Mark.
> >
> >> +
> >> + /* Compute max run time */
> >> + do_div(sec, freq);
> >> +
> >> + /*
> >> + * Compute mul and shift values that will be used to convert cycle in
> >> + * ns thanks to the formula ns = (cyc * mult) >> shift
> >> + */
> >> + clocks_calc_mult_shift(&rt_inst_mult, &rt_inst_shift, freq,
> >> + NSEC_PER_SEC, sec);
> >> +
> >> + /* Create debugfs entries */
> >> + psci_debugfs_dir = debugfs_create_dir("psci", NULL);
> >> +
> >> + if (!psci_debugfs_dir)
> >> + return -ENOMEM;
> >> +
> >> + if (!debugfs_create_file("latency_current", 0444,
> >> + psci_debugfs_dir, NULL, &psci_current_fops))
> >> + goto error;
> >> +
> >> + if (!debugfs_create_file("enable", 0644,
> >> + psci_debugfs_dir, NULL, &psci_enable_fops))
> >> + goto error;
> >> +
> >> + if (!debugfs_create_file("latency_summary", 0444,
> >> + psci_debugfs_dir, NULL, &psci_summary_fops))
> >> + goto error;
> >> +
> >> + return 0;
> >> +
> >> +error:
> >> + debugfs_remove_recursive(psci_debugfs_dir);
> >> + return -ENOMEM;
> >> +}
> >> +late_initcall(psci_debug_init);
> >> +
> >> +static void __exit psci_debug_exit(void)
> >> +{
> >> + debugfs_remove_recursive(psci_debugfs_dir);
> >> +}
> >> +__exitcall(psci_debug_exit);
> >> +
> >> +#elif defined(CONFIG_CPU_IDLE)
> >> +static void psci_update_all_rt_instr_this_cpu(int cache) {}
> >> +#endif
> >> --
> >> 2.7.4
> >>
More information about the linux-arm-kernel
mailing list