[PATCH v8 1/4] cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry

Kai-Heng Feng kaihengf at nvidia.com
Mon Dec 16 00:33:28 PST 2024


Hi Sumit,

On 2024/12/16 3:11 PM, Sumit Gupta wrote:
> 
> 
> On 16/12/24 11:13, Kai-Heng Feng wrote:
>> Hi Beata,
>>
>> On 2024/12/6 9:55 PM, Beata Michalska wrote:
>>> Currently the CPUFreq core exposes two sysfs attributes that can be used
>>> to query current frequency of a given CPU(s): namely cpuinfo_cur_freq
>>> and scaling_cur_freq. Both provide slightly different view on the
>>> subject and they do come with their own drawbacks.
>>>
>>> cpuinfo_cur_freq provides higher precision though at a cost of being
>>> rather expensive. Moreover, the information retrieved via this attribute
>>> is somewhat short lived as frequency can change at any point of time
>>> making it difficult to reason from.
>>>
>>> scaling_cur_freq, on the other hand, tends to be less accurate but then
>>> the actual level of precision (and source of information) varies between
>>> architectures making it a bit ambiguous.
>>>
>>> The new attribute, cpuinfo_avg_freq, is intended to provide more stable,
>>> distinct interface, exposing an average frequency of a given CPU(s), as
>>> reported by the hardware, over a time frame spanning no more than a few
>>> milliseconds. As it requires appropriate hardware support, this
>>> interface is optional.
>>>
>>> Note that under the hood, the new attribute relies on the information
>>> provided by arch_freq_get_on_cpu, which, up to this point, has been
>>> feeding data for scaling_cur_freq attribute, being the source of
>>> ambiguity when it comes to interpretation. This has been amended by
>>> restoring the intended behavior for scaling_cur_freq, with a new
>>> dedicated config option to maintain status quo for those, who may need
>>> it.
>>>
>>> CC: Jonathan Corbet <corbet at lwn.net>
>>> CC: Thomas Gleixner <tglx at linutronix.de>
>>> CC: Ingo Molnar <mingo at redhat.com>
>>> CC: Borislav Petkov <bp at alien8.de>
>>> CC: Dave Hansen <dave.hansen at linux.intel.com>
>>> CC: H. Peter Anvin <hpa at zytor.com>
>>> CC: Phil Auld <pauld at redhat.com>
>>> CC: x86 at kernel.org
>>> CC: linux-doc at vger.kernel.org
>>> Signed-off-by: Beata Michalska <beata.michalska at arm.com>
>>> ---
>>>   Documentation/admin-guide/pm/cpufreq.rst | 16 ++++++++++-
>>>   arch/x86/kernel/cpu/aperfmperf.c         |  2 +-
>>>   arch/x86/kernel/cpu/proc.c               |  7 +++--
>>>   drivers/cpufreq/Kconfig.x86              | 12 ++++++++
>>>   drivers/cpufreq/cpufreq.c                | 36 +++++++++++++++++++++---
>>>   include/linux/cpufreq.h                  |  2 +-
>>>   6 files changed, 66 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin- 
>>> guide/pm/cpufreq.rst
>>> index fe1be4ad88cb..76f3835afe01 100644
>>> --- a/Documentation/admin-guide/pm/cpufreq.rst
>>> +++ b/Documentation/admin-guide/pm/cpufreq.rst
>>> @@ -248,6 +248,19 @@ are the following:
>>>       If that frequency cannot be determined, this attribute should not
>>>       be present.
>>> +``cpuinfo_avg_freq``
>>> +        An average frequency (in KHz) of all CPUs belonging to a given policy,
>>> +        derived from a hardware provided feedback and reported on a time frame
>>> +        spanning at most few milliseconds.
>>> +
>>> +        This is expected to be based on the frequency the hardware actually 
>>> runs
>>> +        at and, as such, might require specialised hardware support (such as 
>>> AMU
>>> +        extension on ARM). If one cannot be determined, this attribute should
>>> +        not be present.
>>> +
>>> +        Note, that failed attempt to retrieve current frequency for a given
>>> +        CPU(s) will result in an appropriate error.
>>> +
>>>   ``cpuinfo_max_freq``
>>>       Maximum possible operating frequency the CPUs belonging to this policy
>>>       can run at (in kHz).
>>> @@ -293,7 +306,8 @@ are the following:
>>>       Some architectures (e.g. ``x86``) may attempt to provide information
>>>       more precisely reflecting the current CPU frequency through this
>>>       attribute, but that still may not be the exact current CPU frequency as
>>> -    seen by the hardware at the moment.
>>> +    seen by the hardware at the moment. This behavior though, is only
>>> +    available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option.
>>>   ``scaling_driver``
>>>       The scaling driver currently in use.
>>> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
>>> index 0b69bfbf345d..a00059139ca4 100644
>>> --- a/arch/x86/kernel/cpu/aperfmperf.c
>>> +++ b/arch/x86/kernel/cpu/aperfmperf.c
>>> @@ -413,7 +413,7 @@ void arch_scale_freq_tick(void)
>>>    */
>>>   #define MAX_SAMPLE_AGE    ((unsigned long)HZ / 50)
>>> -unsigned int arch_freq_get_on_cpu(int cpu)
>>> +int arch_freq_get_on_cpu(int cpu)
>>>   {
>>>       struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
>>>       unsigned int seq, freq;
>>> diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
>>> index e65fae63660e..34d8fb93fb70 100644
>>> --- a/arch/x86/kernel/cpu/proc.c
>>> +++ b/arch/x86/kernel/cpu/proc.c
>>> @@ -86,9 +86,12 @@ static int show_cpuinfo(struct seq_file *m, void *v)
>>>           seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
>>>       if (cpu_has(c, X86_FEATURE_TSC)) {
>>> -        unsigned int freq = arch_freq_get_on_cpu(cpu);
>>> +        int freq = arch_freq_get_on_cpu(cpu);
>>> -        seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
>>> +        if (freq <= 0)
>>> +            seq_puts(m, "cpu MHz\t\t: Unknown\n");
>>> +        else
>>> +            seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 
>>> 1000));
>>>       }
>>>       /* Cache size */
>>> diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
>>> index 97c2d4f15d76..212e1b9afe21 100644
>>> --- a/drivers/cpufreq/Kconfig.x86
>>> +++ b/drivers/cpufreq/Kconfig.x86
>>> @@ -340,3 +340,15 @@ config X86_SPEEDSTEP_RELAXED_CAP_CHECK
>>>         option lets the probing code bypass some of those checks if the
>>>         parameter "relaxed_check=1" is passed to the module.
>>> +config CPUFREQ_ARCH_CUR_FREQ
>>> +    default y
>>> +    bool "Current frequency derived from HW provided feedback"
>>> +    help
>>> +      This determines whether the scaling_cur_freq sysfs attribute returns
>>> +      the last requested frequency or a more precise value based on hardware
>>> +      provided feedback (as architected counters).
>>> +      Given that a more precise frequency can now be provided via the
>>> +      cpuinfo_avg_cur_freq attribute, by enabling this option,
>>> +      scaling_cur_freq maintains the provision of a counter based frequency,
>>> +      for compatibility reasons.
>>> +
>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>> index 04fc786dd2c0..70df2a24437b 100644
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -747,9 +747,14 @@ show_one(cpuinfo_transition_latency, 
>>> cpuinfo.transition_latency);
>>>   show_one(scaling_min_freq, min);
>>>   show_one(scaling_max_freq, max);
>>> -__weak unsigned int arch_freq_get_on_cpu(int cpu)
>>> +__weak int arch_freq_get_on_cpu(int cpu)
>>>   {
>>> -    return 0;
>>> +    return -EOPNOTSUPP;
>>> +}
>>> +
>>> +static inline bool cpufreq_avg_freq_supported(struct cpufreq_policy *policy)
>>> +{
>>> +    return arch_freq_get_on_cpu(policy->cpu) != -EOPNOTSUPP;
>>>   }
>>>   static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>>> @@ -757,8 +762,11 @@ static ssize_t show_scaling_cur_freq(struct 
>>> cpufreq_policy *policy, char *buf)
>>>       ssize_t ret;
>>>       unsigned int freq;
>>> -    freq = arch_freq_get_on_cpu(policy->cpu);
>>> -    if (freq)
>>> +    freq = IS_ENABLED(CONFIG_CPUFREQ_ARCH_CUR_FREQ)
>>> +        ? arch_freq_get_on_cpu(policy->cpu)
>>> +        : 0;
>>> +
>>> +    if (freq > 0)
>>>           ret = sysfs_emit(buf, "%u\n", freq);
>>>       else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
>>>           ret = sysfs_emit(buf, "%u\n", cpufreq_driver->get(policy->cpu));
>>> @@ -802,6 +810,19 @@ static ssize_t show_cpuinfo_cur_freq(struct 
>>> cpufreq_policy *policy,
>>>       return sysfs_emit(buf, "<unknown>\n");
>>>   }
>>> +/*
>>> + * show_cpuinfo_avg_freq - average CPU frequency as detected by hardware
>>> + */
>>> +static ssize_t show_cpuinfo_avg_freq(struct cpufreq_policy *policy,
>>> +                     char *buf)
>>> +{
>>> +    int avg_freq = arch_freq_get_on_cpu(policy->cpu);
>>
>> We are seeing issues when reading cpuinfo_avg_freq on an ARM64 system:
>>
>> $ cat /sys/devices/system/cpu/cpufreq/policy1/cpuinfo_avg_freq
>> cat: /sys/devices/system/cpu/cpufreq/policy1/cpuinfo_avg_freq: Resource 
>> temporarily unavailable
>>
>> The CPU is in idle state, so arch_freq_get_on_cpu() can't find a good 
>> alternative source for frequency info.
>>
> 
> Hi Kai Heng,
> This has already been discussed during v7 in [1] & [2].

Thanks for the info!

> In v7, we were returning zero which printed 'unknown'.
> The discussion was about printing in more descriptive way or with an appropriate 
> error code. In v8 we are returning 'EAGAIN' instead of zero. The final decision 
> was of Maintainers.

Is there any cpufreq driver that prints "unknown" or error when CPU is in idle?

I think it's more unsurprising to print the lowest CPU frequency when CPU is in 
idle state, instead of any other error code.

Kai-Heng

> 
> Viresh,
> You have any preference on this?
> 
> [1] https://lore.kernel.org/lkml/aa254516-968e-4665-bb5b-981c296ffc35@nvidia.com/#t
> [2] https://lore.kernel.org/lkml/Zyh-uVSW-0d0r8oB@arm.com/
> 
> Thank you,
> Sumit Gupta
> 
>> One way to resolve this is to have fallback methods in show_cpuinfo_avg_freq() 
>> so it will look like this:
>>
>> static ssize_t show_cpuinfo_avg_freq(struct cpufreq_policy *policy,
>>                                       char *buf)
>> {
>>          int avg_freq = arch_freq_get_on_cpu(policy->cpu);
>>          int ret;
>>
>>          if (avg_freq > 0)
>>                  ret = sysfs_emit(buf, "%u\n", avg_freq);
>>          else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
>>                  ret = sysfs_emit(buf, "%u\n", cpufreq_driver->get(policy->cpu));
>>          else
>>                  ret = sysfs_emit(buf, "%u\n", policy->cur);
>>          return ret;
>> }
>>
>> But that also makes show_cpuinfo_avg_freq() pretty much the same as 
>> show_scaling_cur_freq().
>>
>> So is it possible to consolidate show_cpuinfo_avg_freq() into 
>> show_scaling_cur_freq(), by making CONFIG_CPUFREQ_ARCH_CUR_FREQ also available 
>> to ARM64?
>>
>> Kai-Heng
>>
>>> +
>>> +    if (avg_freq > 0)
>>> +        return sysfs_emit(buf, "%u\n", avg_freq);
>>> +    return avg_freq != 0 ? avg_freq : -EINVAL;
>>> +}
>>> +
>>>   /*
>>>    * show_scaling_governor - show the current policy for the specified CPU
>>>    */
>>> @@ -964,6 +985,7 @@ static ssize_t show_bios_limit(struct cpufreq_policy 
>>> *policy, char *buf)
>>>   }
>>>   cpufreq_freq_attr_ro_perm(cpuinfo_cur_freq, 0400);
>>> +cpufreq_freq_attr_ro(cpuinfo_avg_freq);
>>>   cpufreq_freq_attr_ro(cpuinfo_min_freq);
>>>   cpufreq_freq_attr_ro(cpuinfo_max_freq);
>>>   cpufreq_freq_attr_ro(cpuinfo_transition_latency);
>>> @@ -1091,6 +1113,12 @@ static int cpufreq_add_dev_interface(struct 
>>> cpufreq_policy *policy)
>>>               return ret;
>>>       }
>>> +    if (cpufreq_avg_freq_supported(policy)) {
>>> +        ret = sysfs_create_file(&policy->kobj, &cpuinfo_avg_freq.attr);
>>> +        if (ret)
>>> +            return ret;
>>> +    }
>>> +
>>>       ret = sysfs_create_file(&policy->kobj, &scaling_cur_freq.attr);
>>>       if (ret)
>>>           return ret;
>>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>>> index d4d2f4d1d7cb..a7b6c0ccf9bc 100644
>>> --- a/include/linux/cpufreq.h
>>> +++ b/include/linux/cpufreq.h
>>> @@ -1194,7 +1194,7 @@ static inline int 
>>> of_perf_domain_get_sharing_cpumask(int pcpu, const char *list_
>>>   }
>>>   #endif
>>> -extern unsigned int arch_freq_get_on_cpu(int cpu);
>>> +extern int arch_freq_get_on_cpu(int cpu);
>>>   #ifndef arch_set_freq_scale
>>>   static __always_inline
>>




More information about the linux-arm-kernel mailing list