[RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing

Fri Nov 16 05:15:47 EST 2012

Hi Vincent,
Thank you for your review.

On 11/15/2012 11:43 PM, Vincent Guittot wrote:
> Hi Preeti,
> 
> On 15 November 2012 17:54, Preeti U Murthy <preeti at linux.vnet.ibm.com> wrote:
>> Currently the load balancer weighs a task based upon its priority,and this
>> weight consequently gets added up to the weight of the run queue that it is
>> on.It is this weight of the runqueue that sums up to a sched group's load
>> which is used to decide the busiest or the idlest group and the runqueue
>> thereof.
>>
>> The Per-Entity-Load-Tracking metric however measures how long a task has
>> been runnable over the duration of its lifetime.This gives us a hint of
>> the amount of CPU time that the task can demand.This metric takes care of the
>> task priority as well.Therefore apart from the priority of a task we also
>> have an idea of the live behavior of the task.This seems to be a more
>> realistic metric to use to compute task weight which adds upto the run queue
>> weight and the weight of the sched group.Consequently they can be used for
>> load balancing.
>>
>> The semantics of load balancing is left untouched.The two functions
>> load_balance() and select_task_rq_fair() perform the task of load
>> balancing.These two paths have been browsed through in this patch to make
>> necessary changes.
>>
>> weighted_cpuload() and task_h_load() provide the run queue weight and the
>> weight of the task respectively.They have been modified to provide the
>> Per-Entity-Load-Tracking metric as relevant for each.
>> The rest of the modifications had to be made to suit these two changes.
>>
>> Completely Fair Scheduler class is the only sched_class which contributes to
>> the run queue load.Therefore the rq->load.weight==cfs_rq->load.weight when
>> the cfs_rq is the root cfs_rq (rq->cfs) of the hierarchy.When replacing this
>> with Per-Entity-Load-Tracking metric,cfs_rq->runnable_load_avg needs to be
>> used as this is the right reflection of the run queue load when
>> the cfs_rq is the root cfs_rq (rq->cfs) of the hierarchy.This metric reflects
>> the percentage uptime of the tasks that are queued on it and hence that contribute
>> to the load.Thus cfs_rq->runnable_load_avg replaces the metric earlier used in
>> weighted_cpuload().
>>
>> The task load is aptly captured by se.avg.load_avg_contrib which captures the
>> runnable time vs the alive time of the task against its priority.This metric
>> replaces the earlier metric used in task_h_load().
>>
>> The consequent changes appear as data type changes for the helper variables;
>> they abound in number.Because cfs_rq->runnable_load_avg needs to be big enough
>> to capture the tasks' load often and accurately.
> 
> You are now using cfs_rq->runnable_load_avg instead of
> cfs_rq->load.weight for calculation of cpu_load but
> cfs_rq->runnable_load_avg is smaller or equal to cfs_rq->load.weight
> value. This implies that the new value is smaller or equal to the old
> statistic so you should be able to keep the same variable width for
> the computation of cpu_load

Right.But cfs_rq->runnable_load_avg is a 64 bit unsigned integer as per
the Per-entity-load-tracking patchset.I could not figure out why this is
the case although as you mention, its value will not exceed
cfs_rq->load.weight.In order to retain the data type of
cfs_rq->runnable_load_avg as it is,these changes had to be made to suit
it.It would be good if someone would clarify why it is a 64 bit
integer,will save a lot of trouble if we could consider this the same
length as cfs_rq->load.weight.Ben,Paul? can you clarify this point?
> 
>>
>> The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND
>> CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the
>> simplest scenario.Earlier discussions can be found in the link below.
>>
>> Link: https://lkml.org/lkml/2012/10/25/162
>> Signed-off-by: Preeti U Murthy<preeti at linux.vnet.ibm.com>
>> ---
>>  include/linux/sched.h |    2 +-
>>  kernel/sched/core.c   |   12 +++++----
>>  kernel/sched/fair.c   |   64 +++++++++++++++++++++++++------------------------
>>  kernel/sched/sched.h  |    2 +-
>>  4 files changed, 40 insertions(+), 40 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 087dd20..302756e 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -924,7 +924,7 @@ struct sched_domain {
>>         unsigned int lb_count[CPU_MAX_IDLE_TYPES];
>>         unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
>>         unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
>> -       unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
>> +       u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
>>         unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
>>         unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
>>         unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 24d8b9b..4dea057 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2415,8 +2415,8 @@ static const unsigned char
>>   * would be when CPU is idle and so we just decay the old load without
>>   * adding any new load.
>>   */
>> -static unsigned long
>> -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
>> +static u64
>> +decay_load_missed(u64 load, unsigned long missed_updates, int idx)
>>  {
>>         int j = 0;
>>
>> @@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
>>   * scheduler tick (TICK_NSEC). With tickless idle this will not be called
>>   * every tick. We fix it up based on jiffies.
>>   */
>> -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>> +static void __update_cpu_load(struct rq *this_rq, u64 this_load,
>>                               unsigned long pending_updates)
>>  {
>>         int i, scale;
>> @@ -2454,7 +2454,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>         /* Update our load: */
>>         this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
>>         for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
>> -               unsigned long old_load, new_load;
>> +               u64 old_load, new_load;
>>
>>                 /* scale is effectively 1 << i now, and >> i divides by scale */
>>
>> @@ -2496,7 +2496,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>  void update_idle_cpu_load(struct rq *this_rq)
>>  {
>>         unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>> -       unsigned long load = this_rq->load.weight;
>> +       u64 load = this_rq->cfs.runnable_load_avg;
>>         unsigned long pending_updates;
>>
>>         /*
>> @@ -2546,7 +2546,7 @@ static void update_cpu_load_active(struct rq *this_rq)
>>          * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>>          */
>>         this_rq->last_load_update_tick = jiffies;
>> -       __update_cpu_load(this_rq, this_rq->load.weight, 1);
>> +       __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
>>
>>         calc_load_account_active(this_rq);
>>  }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a9cdc8f..f8f3a29 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2935,9 +2935,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>
>>  #ifdef CONFIG_SMP
>>  /* Used instead of source_load when we know the type == 0 */
>> -static unsigned long weighted_cpuload(const int cpu)
>> +static u64 weighted_cpuload(const int cpu)
>>  {
>> -       return cpu_rq(cpu)->load.weight;
>> +       return cpu_rq(cpu)->cfs.runnable_load_avg;
>>  }
>>
>>  /*
>> @@ -2947,10 +2947,10 @@ static unsigned long weighted_cpuload(const int cpu)
>>   * We want to under-estimate the load of migration sources, to
>>   * balance conservatively.
>>   */
>> -static unsigned long source_load(int cpu, int type)
>> +static u64 source_load(int cpu, int type)
>>  {
>>         struct rq *rq = cpu_rq(cpu);
>> -       unsigned long total = weighted_cpuload(cpu);
>> +       u64 total = weighted_cpuload(cpu);
>>
>>         if (type == 0 || !sched_feat(LB_BIAS))
>>                 return total;
>> @@ -2962,10 +2962,10 @@ static unsigned long source_load(int cpu, int type)
>>   * Return a high guess at the load of a migration-target cpu weighted
>>   * according to the scheduling class and "nice" value.
>>   */
>> -static unsigned long target_load(int cpu, int type)
>> +static u64 target_load(int cpu, int type)
>>  {
>>         struct rq *rq = cpu_rq(cpu);
>> -       unsigned long total = weighted_cpuload(cpu);
>> +       u64 total = weighted_cpuload(cpu);
>>
>>         if (type == 0 || !sched_feat(LB_BIAS))
>>                 return total;
>> @@ -2978,13 +2978,13 @@ static unsigned long power_of(int cpu)
>>         return cpu_rq(cpu)->cpu_power;
>>  }
>>
>> -static unsigned long cpu_avg_load_per_task(int cpu)
>> +static u64 cpu_avg_load_per_task(int cpu)
>>  {
>>         struct rq *rq = cpu_rq(cpu);
>>         unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>>
>>         if (nr_running)
>> -               return rq->load.weight / nr_running;
>> +               return rq->cfs.runnable_load_avg / nr_running;
> 
> You now need to use div_u64 for all division of a 64bits variable like
> runnable_load_avg otherwise it can't compile on 32bits platform like
> ARM. This one is obvious because it appears in your patch but other
> division could be now 64bits division

Ah yes,there will be some trouble here,Explicit do_div() calls need to
be inserted,and there will be plenty such cases.But as I mentioned
above,once we are clear about why the width of cfs_rq->runnable_load_avg
is 64 bit, we can sort this out.We will need someone to clarify this.

I am at a loss to see the solution around making the above changes if
for some reason the width of cfs_rq->runnable_load_avg has to be
maintained as is.any thoughts on this?
> 
> Regards,
> Vincent

Regards
Preeti U Murthy