[RFC 3/6] sched: pack small tasks
Vincent Guittot
vincent.guittot at linaro.org
Tue Nov 20 11:59:46 EST 2012
On 20 November 2012 15:28, Morten Rasmussen <Morten.Rasmussen at arm.com> wrote:
> Hi Vincent,
>
> On Mon, Nov 12, 2012 at 01:51:00PM +0000, Vincent Guittot wrote:
>> On 9 November 2012 18:13, Morten Rasmussen <Morten.Rasmussen at arm.com> wrote:
>> > Hi Vincent,
>> >
>> > I have experienced suboptimal buddy selection on a dual cluster setup
>> > (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
>> > CPU level. This seems to be the correct flag settings for a system with
>> > only cluster level power gating.
>> >
>> > To me it looks like update_packing_domain() is not doing the right
>> > thing. See inline comments below.
>>
>> Hi Morten,
>>
>> Thanks for testing the patches.
>>
>> It seems that I have too optimized the loop and remove some use cases.
>>
>> >
>> > On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
>> >> During sched_domain creation, we define a pack buddy CPU if available.
>> >>
>> >> On a system that share the powerline at all level, the buddy is set to -1
>> >>
>> >> On a dual clusters / dual cores system which can powergate each core and
>> >> cluster independantly, the buddy configuration will be :
>> >> | CPU0 | CPU1 | CPU2 | CPU3 |
>> >> -----------------------------------
>> >> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>> >>
>> >> Small tasks tend to slip out of the periodic load balance.
>> >> The best place to choose to migrate them is at their wake up.
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot at linaro.org>
>> >> ---
>> >> kernel/sched/core.c | 1 +
>> >> kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> kernel/sched/sched.h | 1 +
>> >> 3 files changed, 111 insertions(+)
>> >>
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index dab7908..70cadbe 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>> >> rcu_assign_pointer(rq->sd, sd);
>> >> destroy_sched_domains(tmp, cpu);
>> >>
>> >> + update_packing_domain(cpu);
>> >> update_top_cache_domain(cpu);
>> >> }
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 4f4a4f6..8c9d3ed 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -157,6 +157,63 @@ void sched_init_granularity(void)
>> >> update_sysctl();
>> >> }
>> >>
>> >> +
>> >> +/*
>> >> + * Save the id of the optimal CPU that should be used to pack small tasks
>> >> + * The value -1 is used when no buddy has been found
>> >> + */
>> >> +DEFINE_PER_CPU(int, sd_pack_buddy);
>> >> +
>> >> +/* Look for the best buddy CPU that can be used to pack small tasks
>> >> + * We make the assumption that it doesn't wort to pack on CPU that share the
>> >> + * same powerline. We looks for the 1st sched_domain without the
>> >> + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
>> >> + * power per core based on the assumption that their power efficiency is
>> >> + * better */
>> >> +void update_packing_domain(int cpu)
>> >> +{
>> >> + struct sched_domain *sd;
>> >> + int id = -1;
>> >> +
>> >> + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
>> >> + if (!sd)
>> >> + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
>> >> + else
>> >> + sd = sd->parent;
>> > sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
>> > groups of the parent level would represent the power domains. If get it
>> > right, we want to pack inside the cluster first and only let first cpu
>>
>> You probably wanted to use sched_group instead of cluster because
>> cluster is only a special use case, didn't you ?
>>
>> > of the cluster do packing on another cluster. So all cpus - except the
>> > first one - in the current sched domain should find its buddy within the
>> > domain and only the first one should go to the parent sched domain to
>> > find its buddy.
>>
>> We don't want to pack in the current sched_domain because it shares
>> power domain. We want to pack at the parent level
>>
>
> Yes. I think we mean the same thing. The packing takes place at the
> parent sched_domain but the sched_group that we are looking at only
> contains the cpus of the level below.
>
>> >
>> > I propose the following fix:
>> >
>> > - sd = sd->parent;
>> > + if (cpumask_first(sched_domain_span(sd)) == cpu
>> > + || !sd->parent)
>> > + sd = sd->parent;
>>
>> We always look for the buddy in the parent level whatever the cpu
>> position in the mask is.
>>
>> >
>> >
>> >> +
>> >> + while (sd) {
>> >> + struct sched_group *sg = sd->groups;
>> >> + struct sched_group *pack = sg;
>> >> + struct sched_group *tmp = sg->next;
>> >> +
>> >> + /* 1st CPU of the sched domain is a good candidate */
>> >> + if (id == -1)
>> >> + id = cpumask_first(sched_domain_span(sd));
>> >
>> > There is no guarantee that id is in the sched group pointed to by
>> > sd->groups, which is implicitly assumed later in the search loop. We
>> > need to find the sched group that contains id and point sg to that
>> > instead. I haven't found an elegant way to find that group, but the fix
>> > below should at least give the right result.
>> >
>> > + /* Find sched group of candidate */
>> > + tmp = sd->groups;
>> > + do {
>> > + if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
>> > + {
>> > + sg = tmp;
>> > + break;
>> > + }
>> > + } while (tmp = tmp->next, tmp != sd->groups);
>> > +
>> > + pack = sg;
>> > + tmp = sg->next;
>>
>>
>> I have a new loop which solves your issue and others. I will use it
>> for the next version
>>
>> + while (sd) {
>> + struct sched_group *sg = sd->groups;
>> + struct sched_group *pack = sg;
>> + struct sched_group *tmp;
>> +
>> + /* The 1st CPU of the local group is a good candidate */
>> + id = cpumask_first(sched_group_cpus(pack));
>
> You make the assumption that the first sched_group in the list always contains
> the current cpu. I think that is always the case, but I haven't verified
> it. Maybe a comment about this would help people to understand the code
> easier.
yes, the first sched_group contains the cpu. I will add a comment
>
>> +
>> + /* loop the sched groups to find the best one */
>> + for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
>> + if (tmp->sgp->power * pack->group_weight >
>> + pack->sgp->power * tmp->group_weight)
>> + continue;
>> +
>> + if ((tmp->sgp->power * pack->group_weight ==
>> + pack->sgp->power * tmp->group_weight)
>> + && (cpumask_first(sched_group_cpus(tmp)) >= id))
>> + continue;
>> +
>> + /* we have found a better group */
>> + pack = tmp;
>> +
>> + /* Take the 1st CPU of the new group */
>> + id = cpumask_first(sched_group_cpus(pack));
>> + }
>> +
>
> Works great on my setup.
>
>> + /* Look for another CPU than itself */
>> + if ((id != cpu)
>> + || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
>> + break;
>
> If I understand correctly the last part of this check should avoid
> selecting a buddy in a sched_group that is not load balanced with the
> current one. In that case, I think that this check (or a similar check)
> should be done before the loop as well. As it is, the first iteration of
> the loop will always search all the groups of the first domain where
> SD_SHARE_POWERLINE is disabled regardless of the state of
> SD_LOAD_BALANCE flag. So if they are both disabled at the same level
> packing will happen across groups that are not supposed to be
> load-balanced.
you're right, i'm going to fix it
Thanks
>
> Regards,
> Morten
>
>> +
>> + sd = sd->parent;
>> + }
>>
>> Regards,
>> Vincent
>>
>> >
>> > Regards,
>> > Morten
>> >
>> >> +
>> >> + /* loop the sched groups to find the best one */
>> >> + while (tmp != sg) {
>> >> + if (tmp->sgp->power * sg->group_weight <
>> >> + sg->sgp->power * tmp->group_weight)
>> >> + pack = tmp;
>> >> + tmp = tmp->next;
>> >> + }
>> >> +
>> >> + /* we have found a better group */
>> >> + if (pack != sg)
>> >> + id = cpumask_first(sched_group_cpus(pack));
>> >> +
>> >> + /* Look for another CPU than itself */
>> >> + if ((id != cpu)
>> >> + || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
>> >> + break;
>> >> +
>> >> + sd = sd->parent;
>> >> + }
>> >> +
>> >> + pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
>> >> + per_cpu(sd_pack_buddy, cpu) = id;
>> >> +}
>> >> +
>> >> #if BITS_PER_LONG == 32
>> >> # define WMULT_CONST (~0UL)
>> >> #else
>> >> @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target)
>> >> return target;
>> >> }
>> >>
>> >> +static inline bool is_buddy_busy(int cpu)
>> >> +{
>> >> + struct rq *rq = cpu_rq(cpu);
>> >> +
>> >> + /*
>> >> + * A busy buddy is a CPU with a high load or a small load with a lot of
>> >> + * running tasks.
>> >> + */
>> >> + return ((rq->avg.usage_avg_sum << rq->nr_running) >
>> >> + rq->avg.runnable_avg_period);
>> >> +}
>> >> +
>> >> +static inline bool is_light_task(struct task_struct *p)
>> >> +{
>> >> + /* A light task runs less than 25% in average */
>> >> + return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);
>> >> +}
>> >> +
>> >> +static int check_pack_buddy(int cpu, struct task_struct *p)
>> >> +{
>> >> + int buddy = per_cpu(sd_pack_buddy, cpu);
>> >> +
>> >> + /* No pack buddy for this CPU */
>> >> + if (buddy == -1)
>> >> + return false;
>> >> +
>> >> + /*
>> >> + * If a task is waiting for running on the CPU which is its own buddy,
>> >> + * let the default behavior to look for a better CPU if available
>> >> + * The threshold has been set to 37.5%
>> >> + */
>> >> + if ((buddy == cpu)
>> >> + && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
>> >> + return false;
>> >> +
>> >> + /* buddy is not an allowed CPU */
>> >> + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
>> >> + return false;
>> >> +
>> >> + /*
>> >> + * If the task is a small one and the buddy is not overloaded,
>> >> + * we use buddy cpu
>> >> + */
>> >> + if (!is_light_task(p) || is_buddy_busy(buddy))
>> >> + return false;
>> >> +
>> >> + return true;
>> >> +}
>> >> +
>> >> /*
>> >> * sched_balance_self: balance the current task (running on cpu) in domains
>> >> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
>> >> @@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>> >> if (p->nr_cpus_allowed == 1)
>> >> return prev_cpu;
>> >>
>> >> + if (check_pack_buddy(cpu, p))
>> >> + return per_cpu(sd_pack_buddy, cpu);
>> >> +
>> >> if (sd_flag & SD_BALANCE_WAKE) {
>> >> if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> >> want_affine = 1;
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index a95d5c1..086d8bf 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)
>> >>
>> >> extern void sysrq_sched_debug_show(void);
>> >> extern void sched_init_granularity(void);
>> >> +extern void update_packing_domain(int cpu);
>> >> extern void update_max_interval(void);
>> >> extern void update_group_power(struct sched_domain *sd, int cpu);
>> >> extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu);
>> >> --
>> >> 1.7.9.5
>> >>
>> >>
>> >> _______________________________________________
>> >> linaro-dev mailing list
>> >> linaro-dev at lists.linaro.org
>> >> http://lists.linaro.org/mailman/listinfo/linaro-dev
>> >>
>> >
>>
>
More information about the linux-arm-kernel
mailing list