[RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Tue Dec 15 04:22:38 PST 2015

On 14/12/15 16:59, Mark Brown wrote:
> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
> > On 11/12/15 17:49, Mark Brown wrote:
> 
> > > The purpose of the capacity values is to influence the scheduler
> > > behaviour and hence performance.  Without a concrete definition they're
> > > just magic numbers which have meaining only in terms of their effect on
> > > the performance of the system.  That is a sufficiently complex outcome
> > > to ensure that there will be an element of taste in what the desired
> > > outcomes are.  Sounds like tuneables to me.
> 
> > Capacity values are meant to describe asymmetry (if any) of the system
> > CPUs to the scheduler. The scheduler can then use this additional bit of
> > information to try to do better scheduling decisions. Yes, having these
> > values available will end up giving you better performance, but I guess
> > this apply to any information we provide to the kernel (and scheduler);
> > the less dumb a subsystem is, the better we can make it work.
> 
> This information is a magic number, there's never going to be a right
> answer.  If it needs changing it's not like the kernel is modeling a
> concrete thing like the relative performance of the A53 and A57 poorly
> or whatever, it's just that the relative values of number A and number B
> are not what the system integrator desires.
> 
> > > If you are saying people should use other, more sensible, ways of
> > > specifying the final values that actually get used in production then
> > > why take the defaults from direct numbers DT in the first place?  If you
> > > are saying that people should tune and then put the values in here then
> > > that's problematic for the reasons I outlined.
> 
> > IMHO, people should come up with default values that describe
> > heterogeneity in their system. Then use other ways to tune the system at
> > run time (depending on the workload maybe).
> 
> My argument is that they should be describing the hetrogeneity of their
> system by describing concrete properties of their system rather than by
> providing magic numbers.
> 
> > As said, I understand your concerns; but, what I don't still get is
> > where CPU capacity values are so different from, say, idle states
> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required
> > to come up with that values as well; you have to pick some benchmark
> > that stresses worst case entry/exit while measuring energy, then make
> > calculations that tells you when it is wise to enter a particular idle
> > state. Ideally we should derive min residency from specs, but I'm not
> > sure is how it works in practice.
> 
> Those at least have a concrete physical value that it is possible to
> measure in a describable way that is unlikely to change based on the
> internals of the kernel.  It would be kind of nice to have the broken
> down numbers for entry time, exit time and power burn in suspend but
> it's not clear it's worth the bother.  It's also one of these things
> where we don't have any real proxies that get us anywhere in the
> ballpark of where we want to be.
> 

I'm proposing to add a new value because I couldn't find any proxies in
the current bindings that bring us any close to what we need. If I
failed in looking for them, and they actually exists, I'll personally be
more then happy to just rely on them instead of adding more stuff :-).

Interestingly, to me it sounds like we could actually use your first
paragraph above almost as it is to describe how to come up with capacity
values. In the documentation I put the following:

"One simple way to estimate CPU capacities is to iteratively run a
well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
each CPU at maximum frequency and then normalize values w.r.t.  the best
performing CPU."

I don't see why this should change if we decide that the scheduler has
to change in the future.

Also, looking again at section 2 of idle-states bindings docs, we have a
nice and accurate description of what min-residency is, but not much
info about how we can actually measure that. Maybe, expanding the docs
section regarding CPU capacity could help?

> > > It also seems a bit strange to expect people to do some tuning in one
> > > place initially and then additional tuning somewhere else later, from
> > > a user point of view I'd expect to always do my tuning in the same
> > > place.
> 
> > I think that runtime tuning needs are much more complex and have finer
> > grained needs than what you can achieve by playing with CPU capacities.
> > And I agree with you, users should only play with these other methods
> > I'm referring to; they should not mess around with platform description
> > bits. They should provide information about runtime needs, then the
> > scheduler (in this case) will do its best to give them acceptable
> > performance using improved knowledge about the platform.
> 
> So then why isn't it adequate to just have things like the core types in
> there and work from there?  Are we really expecting the tuning to be so
> much better than it's possible to come up with something that's so much
> better on the scale that we're expecting this to be accurate that it's
> worth just jumping straight to magic numbers?
> 

I take your point here that having fine grained values might not really
give us appreciable differences (that is also why I proposed the
capacity-scale in the first instance), but I'm not sure I'm getting what
you are proposing here.

Today, and for arm only, we have a static table representing CPUs
"efficiency":

 /*
  * Table of relative efficiency of each processors
  * The efficiency value must fit in 20bit and the final
  * cpu_scale value must be in the range
  *   0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
  * in order to return at most 1 when DIV_ROUND_CLOSEST
  * is used to compute the capacity of a CPU.
  * Processors that are not defined in the table,
  * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
  */
 static const struct cpu_efficiency table_efficiency[] = {
 	{"arm,cortex-a15", 3891},
 	{"arm,cortex-a7",  2048},
 	{NULL, },
 };

When clock-frequency property is defined in DT, we try to find a match
for the compatibility string in the table above and then use the
associate number to compute the capacity. Are you proposing to have
something like this for arm64 as well?

BTW, the only info I could find about those numbers is from this thread

 http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html

Vincent, do we have more precise information about these numbers
somewhere else?

If I understand how that table was created, how do we think we will
extend it in the future to allow newer core types (say we replicate this
solution for arm64)?  It seems that we have to change it, rescaling
values, each time we have a new core on the market. How can we come up
with relative numbers, in the future, comparing newer cores to old ones
(that might be already out of the market by that time)?

> > > Doing that and then switching to some other interface for real tuning
> > > seems especially odd and I'm not sure that's something that users are
> > > going to expect or understand.
> 
> > As I'm saying above, users should not care about this first step of
> > platform description; not more than how much they care about other bits
> > in DTs that describe their platform.
> 
> That may be your intention but I don't see how it is realistic to expect
> that this is what people will actually understand.  It's a number, it
> has an effect and it's hard to see that people won't tune it, it's not
> like people don't have to edit DTs during system integration.  People
> won't reliably read documentation or look in mailing list threads and
> other that that it has all the properties of a tuning interface.
> 

Eh, sad but true. I guess we can, as we usually do, put more effort in
documenting how things are supposed to be used. Then, if people think
that they can make their system perform better without looking at
documentation or asking around, I'm not sure there is much we could do
to prevent them to do things wrong. There are already lot of things
people shouldn't touch if they don't know what they are doing. :-/

> There's a tension here between what you're saying about people not being
> supposed to care much about the numbers for tuning and the very fact
> that there's a need for the DT to carry explicit numbers.

My point is that people with tuning needs shouldn't even look at DTs,
but put all their efforts in describing (using appropriate APIs) their
needs and how they apply to the workload they care about. Our job is to
put together information coming from users and knowledge of system
configuration to provide people the desired outcomes.

Best,

- Juri