[PATCH v8 1/4] cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry

Kai-Heng Feng kaihengf at nvidia.com
Sun Dec 15 21:43:02 PST 2024


Hi Beata,

On 2024/12/6 9:55 PM, Beata Michalska wrote:
> Currently the CPUFreq core exposes two sysfs attributes that can be used
> to query current frequency of a given CPU(s): namely cpuinfo_cur_freq
> and scaling_cur_freq. Both provide slightly different view on the
> subject and they do come with their own drawbacks.
> 
> cpuinfo_cur_freq provides higher precision though at a cost of being
> rather expensive. Moreover, the information retrieved via this attribute
> is somewhat short lived as frequency can change at any point of time
> making it difficult to reason from.
> 
> scaling_cur_freq, on the other hand, tends to be less accurate but then
> the actual level of precision (and source of information) varies between
> architectures making it a bit ambiguous.
> 
> The new attribute, cpuinfo_avg_freq, is intended to provide more stable,
> distinct interface, exposing an average frequency of a given CPU(s), as
> reported by the hardware, over a time frame spanning no more than a few
> milliseconds. As it requires appropriate hardware support, this
> interface is optional.
> 
> Note that under the hood, the new attribute relies on the information
> provided by arch_freq_get_on_cpu, which, up to this point, has been
> feeding data for scaling_cur_freq attribute, being the source of
> ambiguity when it comes to interpretation. This has been amended by
> restoring the intended behavior for scaling_cur_freq, with a new
> dedicated config option to maintain status quo for those, who may need
> it.
> 
> CC: Jonathan Corbet <corbet at lwn.net>
> CC: Thomas Gleixner <tglx at linutronix.de>
> CC: Ingo Molnar <mingo at redhat.com>
> CC: Borislav Petkov <bp at alien8.de>
> CC: Dave Hansen <dave.hansen at linux.intel.com>
> CC: H. Peter Anvin <hpa at zytor.com>
> CC: Phil Auld <pauld at redhat.com>
> CC: x86 at kernel.org
> CC: linux-doc at vger.kernel.org
> Signed-off-by: Beata Michalska <beata.michalska at arm.com>
> ---
>   Documentation/admin-guide/pm/cpufreq.rst | 16 ++++++++++-
>   arch/x86/kernel/cpu/aperfmperf.c         |  2 +-
>   arch/x86/kernel/cpu/proc.c               |  7 +++--
>   drivers/cpufreq/Kconfig.x86              | 12 ++++++++
>   drivers/cpufreq/cpufreq.c                | 36 +++++++++++++++++++++---
>   include/linux/cpufreq.h                  |  2 +-
>   6 files changed, 66 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index fe1be4ad88cb..76f3835afe01 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -248,6 +248,19 @@ are the following:
>   	If that frequency cannot be determined, this attribute should not
>   	be present.
>   
> +``cpuinfo_avg_freq``
> +        An average frequency (in KHz) of all CPUs belonging to a given policy,
> +        derived from a hardware provided feedback and reported on a time frame
> +        spanning at most few milliseconds.
> +
> +        This is expected to be based on the frequency the hardware actually runs
> +        at and, as such, might require specialised hardware support (such as AMU
> +        extension on ARM). If one cannot be determined, this attribute should
> +        not be present.
> +
> +        Note, that failed attempt to retrieve current frequency for a given
> +        CPU(s) will result in an appropriate error.
> +
>   ``cpuinfo_max_freq``
>   	Maximum possible operating frequency the CPUs belonging to this policy
>   	can run at (in kHz).
> @@ -293,7 +306,8 @@ are the following:
>   	Some architectures (e.g. ``x86``) may attempt to provide information
>   	more precisely reflecting the current CPU frequency through this
>   	attribute, but that still may not be the exact current CPU frequency as
> -	seen by the hardware at the moment.
> +	seen by the hardware at the moment. This behavior though, is only
> +	available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option.
>   
>   ``scaling_driver``
>   	The scaling driver currently in use.
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> index 0b69bfbf345d..a00059139ca4 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -413,7 +413,7 @@ void arch_scale_freq_tick(void)
>    */
>   #define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)
>   
> -unsigned int arch_freq_get_on_cpu(int cpu)
> +int arch_freq_get_on_cpu(int cpu)
>   {
>   	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
>   	unsigned int seq, freq;
> diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
> index e65fae63660e..34d8fb93fb70 100644
> --- a/arch/x86/kernel/cpu/proc.c
> +++ b/arch/x86/kernel/cpu/proc.c
> @@ -86,9 +86,12 @@ static int show_cpuinfo(struct seq_file *m, void *v)
>   		seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
>   
>   	if (cpu_has(c, X86_FEATURE_TSC)) {
> -		unsigned int freq = arch_freq_get_on_cpu(cpu);
> +		int freq = arch_freq_get_on_cpu(cpu);
>   
> -		seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
> +		if (freq <= 0)
> +			seq_puts(m, "cpu MHz\t\t: Unknown\n");
> +		else
> +			seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
>   	}
>   
>   	/* Cache size */
> diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
> index 97c2d4f15d76..212e1b9afe21 100644
> --- a/drivers/cpufreq/Kconfig.x86
> +++ b/drivers/cpufreq/Kconfig.x86
> @@ -340,3 +340,15 @@ config X86_SPEEDSTEP_RELAXED_CAP_CHECK
>   	  option lets the probing code bypass some of those checks if the
>   	  parameter "relaxed_check=1" is passed to the module.
>   
> +config CPUFREQ_ARCH_CUR_FREQ
> +	default y
> +	bool "Current frequency derived from HW provided feedback"
> +	help
> +	  This determines whether the scaling_cur_freq sysfs attribute returns
> +	  the last requested frequency or a more precise value based on hardware
> +	  provided feedback (as architected counters).
> +	  Given that a more precise frequency can now be provided via the
> +	  cpuinfo_avg_cur_freq attribute, by enabling this option,
> +	  scaling_cur_freq maintains the provision of a counter based frequency,
> +	  for compatibility reasons.
> +
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 04fc786dd2c0..70df2a24437b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -747,9 +747,14 @@ show_one(cpuinfo_transition_latency, cpuinfo.transition_latency);
>   show_one(scaling_min_freq, min);
>   show_one(scaling_max_freq, max);
>   
> -__weak unsigned int arch_freq_get_on_cpu(int cpu)
> +__weak int arch_freq_get_on_cpu(int cpu)
>   {
> -	return 0;
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline bool cpufreq_avg_freq_supported(struct cpufreq_policy *policy)
> +{
> +	return arch_freq_get_on_cpu(policy->cpu) != -EOPNOTSUPP;
>   }
>   
>   static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> @@ -757,8 +762,11 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>   	ssize_t ret;
>   	unsigned int freq;
>   
> -	freq = arch_freq_get_on_cpu(policy->cpu);
> -	if (freq)
> +	freq = IS_ENABLED(CONFIG_CPUFREQ_ARCH_CUR_FREQ)
> +		? arch_freq_get_on_cpu(policy->cpu)
> +		: 0;
> +
> +	if (freq > 0)
>   		ret = sysfs_emit(buf, "%u\n", freq);
>   	else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
>   		ret = sysfs_emit(buf, "%u\n", cpufreq_driver->get(policy->cpu));
> @@ -802,6 +810,19 @@ static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
>   	return sysfs_emit(buf, "<unknown>\n");
>   }
>   
> +/*
> + * show_cpuinfo_avg_freq - average CPU frequency as detected by hardware
> + */
> +static ssize_t show_cpuinfo_avg_freq(struct cpufreq_policy *policy,
> +				     char *buf)
> +{
> +	int avg_freq = arch_freq_get_on_cpu(policy->cpu);

We are seeing issues when reading cpuinfo_avg_freq on an ARM64 system:

$ cat /sys/devices/system/cpu/cpufreq/policy1/cpuinfo_avg_freq
cat: /sys/devices/system/cpu/cpufreq/policy1/cpuinfo_avg_freq: Resource 
temporarily unavailable

The CPU is in idle state, so arch_freq_get_on_cpu() can't find a good 
alternative source for frequency info.

One way to resolve this is to have fallback methods in show_cpuinfo_avg_freq() 
so it will look like this:

static ssize_t show_cpuinfo_avg_freq(struct cpufreq_policy *policy,
                                      char *buf)
{
         int avg_freq = arch_freq_get_on_cpu(policy->cpu);
         int ret;

         if (avg_freq > 0)
                 ret = sysfs_emit(buf, "%u\n", avg_freq);
         else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
                 ret = sysfs_emit(buf, "%u\n", cpufreq_driver->get(policy->cpu));
         else
                 ret = sysfs_emit(buf, "%u\n", policy->cur);
         return ret;
}

But that also makes show_cpuinfo_avg_freq() pretty much the same as 
show_scaling_cur_freq().

So is it possible to consolidate show_cpuinfo_avg_freq() into 
show_scaling_cur_freq(), by making CONFIG_CPUFREQ_ARCH_CUR_FREQ also available 
to ARM64?

Kai-Heng

> +
> +	if (avg_freq > 0)
> +		return sysfs_emit(buf, "%u\n", avg_freq);
> +	return avg_freq != 0 ? avg_freq : -EINVAL;
> +}
> +
>   /*
>    * show_scaling_governor - show the current policy for the specified CPU
>    */
> @@ -964,6 +985,7 @@ static ssize_t show_bios_limit(struct cpufreq_policy *policy, char *buf)
>   }
>   
>   cpufreq_freq_attr_ro_perm(cpuinfo_cur_freq, 0400);
> +cpufreq_freq_attr_ro(cpuinfo_avg_freq);
>   cpufreq_freq_attr_ro(cpuinfo_min_freq);
>   cpufreq_freq_attr_ro(cpuinfo_max_freq);
>   cpufreq_freq_attr_ro(cpuinfo_transition_latency);
> @@ -1091,6 +1113,12 @@ static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>   			return ret;
>   	}
>   
> +	if (cpufreq_avg_freq_supported(policy)) {
> +		ret = sysfs_create_file(&policy->kobj, &cpuinfo_avg_freq.attr);
> +		if (ret)
> +			return ret;
> +	}
> +
>   	ret = sysfs_create_file(&policy->kobj, &scaling_cur_freq.attr);
>   	if (ret)
>   		return ret;
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index d4d2f4d1d7cb..a7b6c0ccf9bc 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1194,7 +1194,7 @@ static inline int of_perf_domain_get_sharing_cpumask(int pcpu, const char *list_
>   }
>   #endif
>   
> -extern unsigned int arch_freq_get_on_cpu(int cpu);
> +extern int arch_freq_get_on_cpu(int cpu);
>   
>   #ifndef arch_set_freq_scale
>   static __always_inline




More information about the linux-arm-kernel mailing list