[PATCH v3 00/12] sched: consolidation of cpu_power

Mon Jun 30 09:05:31 PDT 2014

Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the group
is fully utilized

Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

Tests results:
I have put below results of 3 kind of tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 3 kernel

tip = tip/sched/core
patch = tip + this patchset
patch+irq = tip + this patchset + irq accounting

each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel

Dual cortex A7     tip   | patch             |  patch+irq
                   stdev | diff       stdev  |  diff       stdev
hackbench     (+/-)1.02% | +0.42%(+/-)1.29%  |  -0.65%(+/-)0.44%
scp           (+/-)0.41% | +0.18%(+/-)0.10%  | +78.05%(+/-)0.70%
ebizzy -t 1   (+/-)1.72% | +1.43%(+/-)1.62%  |  +2.58%(+/-)2.11%
ebizzy -t 2   (+/-)0.42% | +0.06%(+/-)0.45%  |  +1.45%(+/-)4.05%
ebizzy -t 4   (+/-)0.73% | +8.39%(+/-)13.25% |  +4.25%(+/-)10.08%
ebizzy -t 6  (+/-)10.30% | +2.19%(+/-)3.59%  |  +0.58%(+/-)1.80%
ebizzy -t 8   (+/-)1.45% | -0.05%(+/-)2.18%  |  +2.53%(+/-)3.40%
ebizzy -t 10  (+/-)3.78% | -2.71%(+/-)2.79%  |  -3.16%(+/-)3.06%
ebizzy -t 12  (+/-)3.21% | +1.13%(+/-)2.02%  |  -1.13%(+/-)4.43%
ebizzy -t 14  (+/-)2.05% | +0.15%(+/-)3.47%  |  -2.08%(+/-)1.40%

uad cortex A15     tip   | patch             |  patch+irq
                   stdev | diff       stdev  |  diff       stdev
hackbench     (+/-)0.55% | -0.58%(+/-)0.90%  |  +0.62%(+/-)0.43%
scp           (+/-)0.21% | -0.10%(+/-)0.10%  |  +5.70%(+/-)0.53%
ebizzy -t 1   (+/-)0.42% | -0.58%(+/-)0.48%  |  -0.29%(+/-)0.18%
ebizzy -t 2   (+/-)0.52% | -0.83%(+/-)0.20%  |  -2.07%(+/-)0.35%
ebizzy -t 4   (+/-)0.22% | -1.39%(+/-)0.49%  |  -1.78%(+/-)0.67%
ebizzy -t 6   (+/-)0.44% | -0.78%(+/-)0.15%  |  -1.79%(+/-)1.10%
ebizzy -t 8   (+/-)0.43% | +0.13%(+/-)0.92%  |  -0.17%(+/-)0.67%
ebizzy -t 10  (+/-)0.71% | +0.10%(+/-)0.93%  |  -0.36%(+/-)0.77%
ebizzy -t 12  (+/-)0.65% | -1.07%(+/-)1.13%  |  -1.13%(+/-)0.70%
ebizzy -t 14  (+/-)0.92% | -0.28%(+/-)1.25%  |  +2.84%(+/-)9.33%

I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377

Vincent Guittot (12):
  sched: fix imbalance flag reset
  sched: remove a wake_affine condition
  sched: fix avg_load computation
  sched: Allow all archs to set the power_orig
  ARM: topology: use new cpu_power interface
  sched: add per rq cpu_power_orig
  sched: test the cpu's capacity in wake affine
  sched: move cfs task on a CPU with higher capacity
  Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
  sched: get CPU's utilization statistic
  sched: replace capacity_factor by utilization
  sched: add SD_PREFER_SIBLING for SMT level

 arch/arm/kernel/topology.c |   4 +-
 kernel/sched/core.c        |   3 +-
 kernel/sched/fair.c        | 290 +++++++++++++++++++++++----------------------
 kernel/sched/sched.h       |   5 +-
 4 files changed, 158 insertions(+), 144 deletions(-)

-- 
1.9.1