[PATCH] psci: add debugfs for runtime instrumentation

Vincent Guittot vincent.guittot at linaro.org
Tue Jun 20 01:08:42 PDT 2017


On 19 June 2017 at 13:25, Lorenzo Pieralisi <lorenzo.pieralisi at arm.com> wrote:
> On Fri, Jun 16, 2017 at 04:26:20PM +0200, Vincent Guittot wrote:
>> Hi Mark,
>>

[ snip]

>> >> +config ARM_PSCI_INSTRUMENTATION
>> >> +     bool "ARM PSCI Run Time Instrumentation"
>> >> +     depends on ARM_PSCI_FW && CPU_IDLE && DEBUG_FS
>> >> +     help
>> >> +       Enable debugfs interface for PSCI FW in which you can retrieve
>> >> +       runtime instrumentation values of the PSCI firmware like time to
>> >> +       enter or leave a cpu suspend state or time to flush cache.
>> >
>> > Looking at the latest published spec (ARM DEN 0022D), PSCI doesn't
>> > provide such information, as there's only PSCI_STAT_{COUNT,RESIDENCY}.
>>
>> This use the same Performance Measurement Framework as
>> PSCI_STAT_{COUNT,RESIDENCY}
>
> Except that as Mark pointed out PSCI_STAT_{COUNT, RESIDENCY} are
> standard calls and part of the PSCI specification and the rest of the
> PMF is implementation defined firmware that is mostly useful for
> debugging...firmware :) so why do we even have to bother adding a kernel
> debugs layer to read those stats ?

These values are really interesting when profiling the idle path of the CPUs

Thanks
>
> Anyway, I am happy to implement PSCI v1.0 calls to retrieve
> PSCI_STAT_{COUNT, RESIDENCY} statistics to add to CPU_IDLE kernel
> stats, as for this patch it does not implement anything that is part of
> the PSCI specification so there is not much else I can comment on I am
> afraid.
>
> Thanks,
> Lorenzo
>
>> > What mechanisms are you using here?
>>
>> More details can be founded here
>> https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/firmware-design.md#13--performance-measurement-framework
>>
>> >
>> > As far as I can tell, these are not part of the PSCI standard, and I'm
>> > not aware of any documentation.
>> >
>> > I'm very much not keen on using IMPLEMENTATION DEFINED functionality in
>> > the PSCI driver, as IMP DEF stuff always creates long-term maintenance
>> > problems. You'll need to convince me that this is worth supporting.
>>
>> I will try.
>>
>> I have put it in psci.c as it is really about measurement latenccy of
>> psci firmware but it can probably be put somewhere else if it is more
>> relevant. The only link between current psci driver and this feature
>> is the call to psci_update_all_rt_instr() in psci_cpu_suspend_enter()
>> to make sure that we have coherent timestamp
>> Thinking a bit more about this, we can probably use the cpu_pm_exit
>>
>> >
>> > [...]
>> >
>> >> +#ifdef CONFIG_ARM_PSCI_INSTRUMENTATION
>> >> +#include <linux/clocksource.h>
>> >> +#include <linux/debugfs.h>
>> >> +#include <linux/seq_file.h>
>> >> +
>> >> +#define PMF_SMC_GET_TIMESTAMP_64     0xC2000010
>> >
>> > What is PMF?
>>
>> Performance Monitoring Framework
>>
>> >
>> > Per the SMCCC, this ID is an atomic {SMC,HVC}64 SiP call.
>> >
>> > I'm really not keen on using IMP DEF SiP calls like this. These are not
>> > part of PSCI.
>> >
>> >> +
>> >> +#define PMF_ARM_TIF_IMPL_ID          0x41
>> >> +
>> >> +#define PMF_RT_INSTR_SVC_ID          1
>> >> +
>> >> +#define RT_INSTR_ENTER_PSCI          0
>> >> +#define RT_INSTR_EXIT_PSCI           1
>> >> +#define RT_INSTR_ENTER_HW_LOW_PWR    2
>> >> +#define RT_INSTR_EXIT_HW_LOW_PWR     3
>> >> +#define RT_INSTR_ENTER_CFLUSH                4
>> >> +#define RT_INSTR_EXIT_CFLUSH         5
>> >> +#define RT_INSTR_TOTAL_IDS           6
>> >> +
>> >> +#define PMF_TID_SHIFT                        0
>> >> +#define PMF_TID_MASK                 (0xFF << PMF_TID_SHIFT)
>> >> +#define PMF_SVC_ID_SHIFT             10
>> >> +#define PMF_SVC_ID_MASK                      (0x3F << PMF_SVC_ID_SHIFT)
>> >> +#define PMF_IMPL_ID_SHIFT            24
>> >> +#define PMF_IMPL_ID_MASK             (0xFFU << PMF_IMPL_ID_SHIFT)
>> >> +
>> >> +struct rt_inst_stats {
>> >> +     unsigned long min;
>> >> +     unsigned long max;
>> >> +     unsigned long avg;
>> >> +     unsigned long count;
>> >> +};
>> >> +
>> >> +static struct dentry *psci_debugfs_dir;
>> >> +
>> >> +/* meaningful instrumentation name */
>> >> +char *rt_inst_name[RT_INSTR_TOTAL_IDS/2] = {
>> >> +     "enter",
>> >> +     "leave",
>> >> +     "cache"
>> >> +};
>> >> +
>> >> +/* timstamp index to use to compute the delta */
>> >> +int rt_inst_delta_idx[RT_INSTR_TOTAL_IDS/2][2] = {
>> >> +     /* Start -->                    End */
>> >> +     {RT_INSTR_ENTER_PSCI,           RT_INSTR_ENTER_HW_LOW_PWR},
>> >> +     {RT_INSTR_EXIT_HW_LOW_PWR,      RT_INSTR_EXIT_PSCI},
>> >> +     {RT_INSTR_ENTER_CFLUSH,         RT_INSTR_EXIT_CFLUSH}
>> >> +};
>> >> +
>> >> +DEFINE_PER_CPU(struct rt_inst_stats[RT_INSTR_TOTAL_IDS/2], rt_inst_summary);
>> >> +int rt_track_enable;
>> >> +
>> >> +u32 rt_inst_shift;
>> >> +u32 rt_inst_mult = 1;
>> >> +
>> >> +static bool psci_rt_instr_enable(void)
>> >> +{
>> >> +     int ret, tid;
>> >> +     int cpu_mpidr = cpu_logical_map(0);
>> >> +     tid |= PMF_RT_INSTR_SVC_ID << PMF_SVC_ID_SHIFT | RT_INSTR_ENTER_PSCI;
>> >
>> > What exactly is this tid parameter?
>>
>> This tid selects which timestamp to get among the list
>>
>> >
>> >> +
>> >> +     arm_smccc_smc(PMF_SMC_GET_TIMESTAMP_64, tid, cpu_mpidr,
>> >> +                     0, 0, 0, 0, 0, &res);
>> >> +
>> >> +     ret = psci_to_linux_errno(res.a0);
>> >
>> > The above was not a PSCI call. Does it follow the same return
>> > conventions?
>>
>> No, psci smc call function only returns a0. In this case of runtime
>> instrumentation, a0 is the error code and a1 is the timestamp
>>
>> >
>> > How big's the returnned timestamp?
>>
>> it's a 64bit value
>>
>> >
>> > How's a big value disambiguated from a negative error code?
>>
>> a0 is error code and a1 is the timestamp. the function
>> psci_rt_instr_enable() just try to make get a timestamp to check that
>> the service is supported by the firmwareso in this case we don't care
>> of a1 but we do in psci_get_one_rt_instr_cpu
>>
>> >
>> >> +
>> >> +     if (ret)
>> >> +             return false;
>> >> +
>> >> +     return true;
>> >> +}
>> >> +
>> >> +static long long psci_get_one_rt_instr_cpu(int id, int cpu, int cache)
>> >> +{
>> >> +     int ret, tid;
>> >> +     int cpu_mpidr = cpu_logical_map(cpu);
>> >> +     struct arm_smccc_res res;
>> >> +
>> >> +     tid = PMF_ARM_TIF_IMPL_ID << PMF_IMPL_ID_SHIFT;
>> >> +     tid |= PMF_RT_INSTR_SVC_ID << PMF_SVC_ID_SHIFT | id;
>> >> +
>> >> +     arm_smccc_smc(PMF_SMC_GET_TIMESTAMP_64, tid, cpu_mpidr, cache,
>> >> +                     0, 0, 0, 0, &res);
>> >
>> > What exaclty does this "cache" parameter do?
>>
>> When set, this parameter forces a dcache invalidate in the PSCI
>> firmware before reading the timestamp
>>
>> >
>> >> +
>> >> +     ret = psci_to_linux_errno(res.a0);
>> >> +
>> >> +     if (ret)
>> >> +             return (long long)(ret);
>> >
>> > Surely the implicit conversion to the return type is sufficient?
>>
>> yes probably. I tend to explicitly cast in order to make the
>> conversion easier to identify when reading the code
>>
>> >
>> >> +
>> >> +     return res.a1;
>> >> +}
>> >> +
>> >> +static int psci_get_all_rt_instr_cpu(int cpu, int cache, long long timestamp[])
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RT_INSTR_TOTAL_IDS; i++) {
>> >> +             timestamp[i] = psci_get_one_rt_instr_cpu(i, cpu, cache);
>> >> +             if (timestamp[i] < 0)
>> >> +                     return (int)(timestamp[i]);
>> >> +     }
>> >> +
>> >> +     return 0;
>> >> +}
>> >> +
>> >> +static void psci_update_all_rt_instr(int cpu, int cache)
>> >> +{
>> >> +     long long tmp_rt_inst[RT_INSTR_TOTAL_IDS];
>> >> +     long tmp_delta[RT_INSTR_TOTAL_IDS/2];
>> >> +     struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
>> >> +     long long start_ts;
>> >> +     int i;
>> >> +
>> >> +     /* Get all RT statistics of the cpu*/
>> >> +     if (psci_get_all_rt_instr_cpu(cpu, cache, tmp_rt_inst))
>> >> +             /* An error happens while getting timestamp */
>> >> +             return;
>> >> +
>> >> +     /* Get start time */
>> >> +     start_ts = tmp_rt_inst[0];
>> >> +
>> >> +     /* Compute delta time */
>> >> +     for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
>> >> +             long long start = tmp_rt_inst[rt_inst_delta_idx[i][0]];
>> >> +             long long end = tmp_rt_inst[rt_inst_delta_idx[i][1]];
>> >> +
>> >> +             /*
>> >> +              * If time is before start time, discard the value which come
>> >> +              * from a previous call
>> >> +              */
>> >
>> > I don't follow this comment.
>>
>> start_ts = tmp_rt_inst[0] is the first timestamp of a PSCI call so if
>> another timestamp is before, this probably mean that it comes from
>> another call like cache maintenance timestamp not updated for this
>> psci call
>>
>> I will update the comment to make it clearer
>>
>> >
>> >> +             if ((start >= start_ts) && (end >= start))
>> >> +                     tmp_delta[i] = end - start;
>> >> +             else
>> >> +                     tmp_delta[i] = -1;
>> >> +     }
>> >> +
>> >> +     for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
>> >> +             if (tmp_delta[i] < 0)
>> >> +                     continue;
>> >> +
>> >> +             if (tmp_delta[i] > my_stats[i].max) {
>> >> +                     /* New max value */
>> >> +                     my_stats[i].max = tmp_delta[i];
>> >> +             }
>> >> +             if (tmp_delta[i] < my_stats[i].min) {
>> >> +                     /* New min value */
>> >> +                     my_stats[i].min = tmp_delta[i];
>> >> +             }
>> >> +     }
>> >> +
>> >> +     /* Update avg */
>> >> +     for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
>> >> +             unsigned long avg;
>> >> +
>> >> +             if (tmp_delta[i] < 0)
>> >> +                     continue;
>> >> +
>> >> +             avg = my_stats[i].avg*my_stats[i].count + tmp_delta[i];
>> >> +
>> >> +             my_stats[i].avg = avg / (my_stats[i].count + 1);
>> >> +
>> >> +             my_stats[i].count++;
>> >> +     }
>> >> +}
>> >> +
>> >> +static void psci_update_all_rt_instr_this_cpu(int cache)
>> >> +{
>> >> +     if (!rt_track_enable)
>> >> +             return;
>> >> +
>> >> +     psci_update_all_rt_instr(raw_smp_processor_id(), cache);
>> >> +}
>> >> +
>> >> +/* Show latency summary all CPUs */
>> >> +static int psci_summary_show(struct seq_file *s, void *data)
>> >> +{
>> >> +     int cpu, i, ret = 0;
>> >> +
>> >> +     for_each_online_cpu(cpu) {
>> >> +             struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
>> >> +
>> >> +             seq_printf(s, "CPU%-2d  step      min     avg     max   count\n", cpu);
>> >> +
>> >> +             for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
>> >> +                     seq_printf(s, "       %5s %7lu %7lu %7lu %7lu\n",
>> >> +                             rt_inst_name[i],
>> >> +                             (my_stats[i].min * rt_inst_mult) >> rt_inst_shift,
>> >> +                             (my_stats[i].avg * rt_inst_mult) >> rt_inst_shift,
>> >> +                             (my_stats[i].max * rt_inst_mult) >> rt_inst_shift,
>> >> +                             my_stats[i].count);
>> >
>> > If we race with a CPU updating its stats, we could see junk here.
>>
>> we should not see junk but may be some partial results like updated
>> value of min, max, avg updated but the count is not updated yet.
>>
>> >
>> >> +             }
>> >> +     }
>> >> +
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +static int psci_summary_open(struct inode *inode, struct file *file)
>> >> +{
>> >> +     return single_open(file, psci_summary_show, NULL);
>> >> +}
>> >> +
>> >> +static const struct file_operations psci_summary_fops = {
>> >> +     .open = psci_summary_open,
>> >> +     .read = seq_read,
>> >> +     .llseek = seq_lseek,
>> >> +     .release = single_release,
>> >> +};
>> >> +
>> >> +
>> >> +/* Enable tracking runtime stats*/
>> >> +static int psci_enable_show(struct seq_file *s, void *data)
>> >> +{
>> >> +     seq_printf(s, "%d\n", rt_track_enable);
>> >> +
>> >> +     return 0;
>> >> +}
>> >> +
>> >> +static int psci_enable_open(struct inode *inode, struct file *file)
>> >> +{
>> >> +     return single_open(file, psci_enable_show, NULL);
>> >> +}
>> >> +
>> >> +static ssize_t psci_enable_store(struct file *file,
>> >> +     const char __user *user_buf, size_t count, loff_t *ppos)
>> >> +{
>> >> +     unsigned long val;
>> >> +     int i, cpu, ret;
>> >> +
>> >> +     ret = kstrtoul_from_user(user_buf, count, 10, &val);
>> >> +     if (ret)
>> >> +             return ret;
>> >> +
>> >> +     rt_track_enable = !!val;
>> >> +
>> >> +     if (rt_track_enable)
>> >> +             for_each_online_cpu(cpu) {
>> >> +                     struct rt_inst_stats *my_stats = per_cpu(rt_inst_summary, cpu);
>> >> +
>> >> +                     for (i = 0; i < RT_INSTR_TOTAL_IDS/2; i++) {
>> >> +                             my_stats[i].min = -1;
>> >> +                             my_stats[i].max = 0;
>> >> +                             my_stats[i].avg = 0;
>> >> +                             my_stats[i].count = 0;
>> >> +                     }
>> >> +             }
>> >> +
>> >> +     return count;
>> >> +}
>> >> +
>> >> +static const struct file_operations psci_enable_fops = {
>> >> +     .open = psci_enable_open,
>> >> +     .write = psci_enable_store,
>> >> +     .read = seq_read,
>> >> +     .llseek = seq_lseek,
>> >> +     .release = single_release,
>> >> +};
>> >> +
>> >> +/* Show current timestamp of all CPUs */
>> >> +static int psci_current_show(struct seq_file *s, void *data)
>> >> +{
>> >> +     int cpu, i, ret = 0;
>> >> +
>> >> +     seq_puts(s, "CPU  INSTR ID  Timestamp\n");
>> >> +
>> >> +     for_each_online_cpu(cpu)
>> >> +             for (i = 0; i < RT_INSTR_TOTAL_IDS; i++) {
>> >> +                     unsigned long long ts;
>> >> +
>> >> +                     ts = (psci_get_one_rt_instr_cpu(i, cpu, 1) * rt_inst_mult)
>> >> +                                     >> rt_inst_shift,
>> >> +                     seq_printf(s, "CPU%-2d  INSTR%d timestamp %llu\n",
>> >> +                                     cpu, i, ts);
>> >> +             }
>> >> +
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +static int psci_current_open(struct inode *inode, struct file *file)
>> >> +{
>> >> +     return single_open(file, psci_current_show, NULL);
>> >> +}
>> >> +
>> >> +static const struct file_operations psci_current_fops = {
>> >> +     .open = psci_current_open,
>> >> +     .read = seq_read,
>> >> +     .llseek = seq_lseek,
>> >> +     .release = single_release,
>> >> +};
>> >> +
>> >> +static int __init psci_debug_init(void)
>> >> +{
>> >> +     u64 sec = GENMASK_ULL((56) - 1, 0);
>> >
>> > It's fine to write 55 directly.
>>
>> yes
>>
>> >
>> > Please note the the timer is guaranteed to be *at least* 56 bits, not
>> > *exactly* 56 bits, so you may need additional masking.
>> >
>> >> +     u32 freq;
>> >> +
>> >> +     /* Test if runtime instrumentation is enable */
>> >> +     if (!psci_rt_instr_enable()) {
>> >
>> > This is an unconditional initcall, so this simply is not safe.
>>
>> you're right, it should be called only if/when psci driver has been probed
>>
>> >
>> > The platform may not have FW, or no secure world.
>> >
>> > On systems without a full SMCCC implementation (e.g. many PSCI 0.1
>> > implementations), this will also be bad.
>> >
>> >> +             pr_info("PSCI RunTime Instrumentation not supported\n");
>> >> +             return 0;
>> >> +     }
>> >> +
>> >> +     /*
>> >> +      * PSCI instrumentation uses same timer ip as arm_arch_timer to count
>> >> +      * clock cycle so we reuse some clocksource helpers to translate cycle
>> >> +      * into ns
>> >> +      */
>> >> +
>> >> +     /* Get counter frequency */
>> >> +     freq = read_sysreg(cntfrq_el0);
>> >
>> > On some platforms, CNTFRQ is wrong.
>> >
>> > I really don't like tying this into the arch timer driver.
>>
>> Not sure to catch your concern ? we don't touch the arch timer and in
>> fact we don't even need it but only reuse some helpers function of
>> clocksource to convert cycle into ns.
>> On kernel side, we only get timestamps from the smc call and this
>> timestamp are then converted in ns but we don't use anything directly
>> from arm_arch_timer
>>
>> Regards,
>> Vincent
>>
>> >
>> > Thanks,
>> > Mark.
>> >
>> >> +
>> >> +     /* Compute max run time */
>> >> +     do_div(sec, freq);
>> >> +
>> >> +     /*
>> >> +      * Compute mul and shift values that will be used to convert cycle in
>> >> +      * ns thanks to the formula ns = (cyc * mult) >> shift
>> >> +      */
>> >> +     clocks_calc_mult_shift(&rt_inst_mult, &rt_inst_shift, freq,
>> >> +                                    NSEC_PER_SEC, sec);
>> >> +
>> >> +     /* Create debugfs entries */
>> >> +     psci_debugfs_dir = debugfs_create_dir("psci", NULL);
>> >> +
>> >> +     if (!psci_debugfs_dir)
>> >> +             return -ENOMEM;
>> >> +
>> >> +     if (!debugfs_create_file("latency_current", 0444,
>> >> +                     psci_debugfs_dir, NULL, &psci_current_fops))
>> >> +             goto error;
>> >> +
>> >> +     if (!debugfs_create_file("enable", 0644,
>> >> +                     psci_debugfs_dir, NULL, &psci_enable_fops))
>> >> +             goto error;
>> >> +
>> >> +     if (!debugfs_create_file("latency_summary", 0444,
>> >> +                     psci_debugfs_dir, NULL, &psci_summary_fops))
>> >> +             goto error;
>> >> +
>> >> +     return 0;
>> >> +
>> >> +error:
>> >> +     debugfs_remove_recursive(psci_debugfs_dir);
>> >> +     return -ENOMEM;
>> >> +}
>> >> +late_initcall(psci_debug_init);
>> >> +
>> >> +static void __exit psci_debug_exit(void)
>> >> +{
>> >> +     debugfs_remove_recursive(psci_debugfs_dir);
>> >> +}
>> >> +__exitcall(psci_debug_exit);
>> >> +
>> >> +#elif defined(CONFIG_CPU_IDLE)
>> >> +static void psci_update_all_rt_instr_this_cpu(int cache) {}
>> >> +#endif
>> >> --
>> >> 2.7.4
>> >>



More information about the linux-arm-kernel mailing list