[RFC] Energy/power monitoring within the kernel

Tue Oct 23 14:49:58 EDT 2012

On 10/23/12 19:30, the mail apparently from Pawel Moll included:
> Greetings All,
>
> More and more of people are getting interested in the subject of power
> (energy) consumption monitoring. We have some external tools like
> "battery simulators", energy probes etc., but some targets can measure
> their power usage on their own.
>
> Traditionally such data should be exposed to the user via hwmon sysfs
> interface, and that's exactly what I did for "my" platform - I have
> a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> enough to draw pretty graphs in userspace. Everyone was happy...
>
> Now I am getting new requests to do more with this data. In particular
> I'm asked how to add such information to ftrace/perf output. The second
> most frequent request is about providing it to a "energy aware"
> cpufreq governor.
>
> I've came up with three (non-mutually exclusive) options. I will
> appreciate any other ideas and comments (including "it makes not sense
> whatsoever" ones, with justification). Of course I am more than willing
> to spend time on prototyping anything that seems reasonable and propose
> patches.
>
>
>
> === Option 1: Trace event ===
>
> This seems to be the "cheapest" option. Simply defining a trace event
> that can be generated by a hwmon (or any other) driver makes the
> interesting data immediately available to any ftrace/perf user. Of
> course it doesn't really help with the cpufreq case, but seems to be
> a good place to start with.
>
> The question is how to define it... I've came up with two prototypes:
>
> = Generic hwmon trace event =
>
> This one allows any driver to generate a trace event whenever any
> "hwmon attribute" (measured value) gets updated. The rate at which the
> updates happen can be controlled by already existing "update_interval"
> attribute.
>
> 8<-------------------------------------------
> TRACE_EVENT(hwmon_attr_update,
> 	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
> 	TP_ARGS(dev, attr, input),
>
> 	TP_STRUCT__entry(
> 		__string(       dev,		dev_name(dev))
> 		__string(	attr,		attr->name)
> 		__field(	long long,	input)
> 	),
>
> 	TP_fast_assign(
> 		__assign_str(dev, dev_name(dev));
> 		__assign_str(attr, attr->name);
> 		__entry->input = input;
> 	),
>
> 	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
> );
> 8<-------------------------------------------
>
> It generates such ftrace message:
>
> <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
>
> One issue with this is that some external knowledge is required to
> relate a number to a processor core. Or maybe it's not an issue at all
> because it should be left for the user(space)?
>
> = CPU power/energy/temperature trace event =
>
> This one is designed to emphasize the relation between the measured
> value (whether it is energy, temperature or any other physical
> phenomena, really) and CPUs, so it is quite specific (too specific?)
>
> 8<-------------------------------------------
> TRACE_EVENT(cpus_environment,
> 	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
> 	TP_ARGS(cpus, value, unit),
>
> 	TP_STRUCT__entry(
> 		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
> 		__field(	long long,	value)
> 		__field(	char,		unit)
> 	),
>
> 	TP_fast_assign(
> 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
> 		__entry->value = value;
> 		__entry->unit = unit;
> 	),
>
> 	TP_printk("cpus %s %lld[%c]",
> 		__print_cpumask((struct cpumask *)__entry->cpus),
> 		__entry->value, __entry->unit)
> );
> 8<-------------------------------------------
>
> And the equivalent ftrace message is:
>
> <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
>
> It's a cpumask, not just single cpu id, because the sensor may measure
> the value per set of CPUs, eg. a temperature of the whole silicon die
> (so all the cores) or an energy consumed by a subset of cores (this
> is my particular use case - two meters monitor a cluster of two
> processors and a cluster of three processors, all working as a SMP
> system).
>
> Of course the cpus __array could be actually a special __cpumask field
> type (I've just hacked the __print_cpumask so far). And I've just
> realised that the unit field should actually be a string to allow unit
> prefixes to be specified (the above should obviously be "34361[mC]"
> not "[C]"). Also - excuse the "cpus_environment" name - this was the
> best I was able to come up with at the time and I'm eager to accept
> any alternative suggestions :-)

A thought on that... from an SoC perspective there are other interesting 
power rails than go to just the CPU core.  For example DDR power and 
rails involved with other IP units on the SoC such as 3D graphics unit. 
  So tying one number to specifically a CPU core does not sound like 
it's enough.

> === Option 2: hwmon perf PMU ===
>
> Although the trace event makes it possible to obtain interesting
> information using perf, the user wouldn't be able to treat the
> energy meter as a normal data source. In particular there would
> be no way of creating a group of events consisting eg. of a
> "normal" leader (eg. cache miss event) triggering energy meter
> read. The only way to get this done is to implement a perf PMU
> backend providing "environmental data" to the user.

In terms of like perf top don't think it'll be possible to know when to 
sample the acquisition hardware to tie the result to a particular line 
of code, even if it had the bandwidth to do that.  Power readings are 
likely to lag activities on the cpu somewhat, considering sub-ns core 
clocks, especially if it's actually measuring the input side of a regulator.

> = High-level hwmon API and PMU =
>
> Current hwmon subsystem does not provide any abstraction for the
> measured values and requires particular drivers to create specified
> sysfs attributes than used by userspace libsensors. This makes
> the framework ultimately flexible and ultimately hard to access
> from within the kernel...
>
> What could be done here is some (simple) API to register the
> measured values with the hwmon core which would result in creating
> equivalent sysfs attributes automagically, but also allow a
> in-kernel API for values enumeration and access. That way the core
> could also register a "hwmon PMU" with the perf framework providing
> data from all "compliant" drivers.
>
> = A driver-specific PMU =
>
> Of course a particular driver could register its own perf PMU on its
> own. It's certainly an option, just very suboptimal in my opinion.
> Or maybe not? Maybe the task is so specialized that it makes sense?
>
>
>
> === Option 3: CPU power(energy) monitoring framework ===
>
> And last but not least, maybe the problem deserves some dedicated
> API? Something that would take providers and feed their data into
> interested parties, in particular a perf PMU implementation and
> cpufreq governors?
>
> Maybe it could be an extension to the thermal framework? It already
> gives some meaning to a physical phenomena. Adding other, related ones
> like energy, and relating it to cpu cores could make some sense.

If you turn the problem upside down to solve the representation question 
first, maybe there's a way forward defining the "power tree" in terms of 
regulators, and then adding something in struct regulator that spams 
readers with timestamped results if the regulator has a power monitoring 
capability.

Then you can map the regulators in the power tree to real devices by the 
names or the supply stuff.  Just a thought.

-Andy

-- 
Andy Green | TI Landing Team Leader
Linaro.org │ Open source software for ARM SoCs | Follow Linaro
http://facebook.com/pages/Linaro/155974581091106  - 
http://twitter.com/#!/linaroorg - http://linaro.org/linaro-blog