[PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

Thu Dec 10 04:40:28 EST 2020

On Tue, 8 Dec 2020 at 17:14, Vincent Guittot <vincent.guittot at linaro.org> wrote:
>
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman at techsingularity.net> wrote:
> >
> > After select_idle_sibling, p->recent_used_cpu is set to the
> > new target. However on the next wakeup, prev will be the same as
> > recent_used_cpu unless the load balancer has moved the task since the last
> > wakeup. It still works, but is less efficient than it can be after all
> > the changes that went in since that reduce unnecessary migrations, load
> > balancer changes etc.  This patch preserves recent_used_cpu for longer.
> >
> > With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
> >
> >                           5.10.0-rc6             5.10.0-rc6
> >                          baseline-v2           altrecent-v2
> > Hmean     1        508.39 (   0.00%)      502.05 *  -1.25%*
> > Hmean     2        986.70 (   0.00%)      983.65 *  -0.31%*
> > Hmean     4       1914.55 (   0.00%)     1920.24 *   0.30%*
> > Hmean     8       3702.37 (   0.00%)     3663.96 *  -1.04%*
> > Hmean     16      6573.11 (   0.00%)     6545.58 *  -0.42%*
> > Hmean     32     10142.57 (   0.00%)    10253.73 *   1.10%*
> > Hmean     64     14348.40 (   0.00%)    12506.31 * -12.84%*
> > Hmean     128    21842.59 (   0.00%)    21967.13 *   0.57%*
> > Hmean     256    20813.75 (   0.00%)    21534.52 *   3.46%*
> > Hmean     320    20684.33 (   0.00%)    21070.14 *   1.87%*
> >
> > The different was marginal except for 64 threads which showed in the
> > baseline that the result was very unstable where as the patch was much
> > more stable. This is somewhat machine specific as on a separate 80-cpu
> > Broadwell machine the same test reported.
> >
> >                           5.10.0-rc6             5.10.0-rc6
> >                          baseline-v2           altrecent-v2
> > Hmean     1        310.36 (   0.00%)      291.81 *  -5.98%*
> > Hmean     2        340.86 (   0.00%)      547.22 *  60.54%*
> > Hmean     4        912.29 (   0.00%)     1063.21 *  16.54%*
> > Hmean     8       2116.40 (   0.00%)     2103.60 *  -0.60%*
> > Hmean     16      4232.90 (   0.00%)     4362.92 *   3.07%*
> > Hmean     32      8442.03 (   0.00%)     8642.10 *   2.37%*
> > Hmean     64     11733.91 (   0.00%)    11473.66 *  -2.22%*
> > Hmean     128    17727.24 (   0.00%)    16784.23 *  -5.32%*
> > Hmean     256    16089.23 (   0.00%)    16110.79 *   0.13%*
> > Hmean     320    15992.60 (   0.00%)    16071.64 *   0.49%*
> >
> > schedstats were not used in this series but from an earlier debugging
> > effort, the schedstats after the test run were as follows;
> >
> > Ops SIS Search               5653107942.00  5726545742.00
> > Ops SIS Domain Search        3365067916.00  3319768543.00
> > Ops SIS Scanned            112173512543.00 99194352541.00
> > Ops SIS Domain Scanned     109885472517.00 96787575342.00
> > Ops SIS Failures             2923185114.00  2950166441.00
> > Ops SIS Recent Used Hit           56547.00   118064916.00
> > Ops SIS Recent Used Miss     1590899250.00   354942791.00
> > Ops SIS Recent Attempts      1590955797.00   473007707.00
> > Ops SIS Search Efficiency             5.04           5.77
> > Ops SIS Domain Search Eff             3.06           3.43
> > Ops SIS Fast Success Rate            40.47          42.03
> > Ops SIS Success Rate                 48.29          48.48
> > Ops SIS Recent Success Rate           0.00          24.96
> >
> > First interesting point is the ridiculous number of times runqueues are
> > enabled -- almost 97 billion times over the course of 40 minutes
> >
> > With the patch, "Recent Used Hit" is over 2000 times more likely to
> > succeed. The failure rate also increases by quite a lot but the cost is
> > marginal even if the "Fast Success Rate" only increases by 2% overall. What
> > cannot be observed from these stats is where the biggest impact as these
> > stats cover low utilisation to over saturation.
> >
> > If graphed over time, the graphs show that the sched domain is only
> > scanned at negligible rates until the machine is fully busy. With
> > low utilisation, the "Fast Success Rate" is almost 100% until the
> > machine is fully busy. For 320 clients, the success rate is close to
> > 0% which is unsurprising.
> >
> > Signed-off-by: Mel Gorman <mgorman at techsingularity.net>
>
> Reviewed-by: Vincent Guittot <vincent.guittot at linaro.org>

This patch is responsible for a performance regression on my thx2 with
hackbench. So although i reviewed it, it should not be applied as the
change in the behavior is far deeper than expected

>
> > ---
> >  kernel/sched/fair.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5c41875aec23..413d895bbbf8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >
> >         /* Check a recently used CPU as a potential idle candidate: */
> >         recent_used_cpu = p->recent_used_cpu;
> > +       p->recent_used_cpu = prev;
> >         if (recent_used_cpu != prev &&
> >             recent_used_cpu != target &&
> >             cpus_share_cache(recent_used_cpu, target) &&
> >             (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >             cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >             asym_fits_capacity(task_util, recent_used_cpu)) {
> > -               /*
> > -                * Replace recent_used_cpu with prev as it is a potential
> > -                * candidate for the next wake:
> > -                */
> > -               p->recent_used_cpu = prev;
> >                 return recent_used_cpu;
> >         }
> >
> > @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> >                 /* Fast path */
> >                 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > -
> > -               if (want_affine)
> > -                       current->recent_used_cpu = cpu;
> >         }
> >         rcu_read_unlock();
> >
> > --
> > 2.26.2
> >