[RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

Mon Mar 8 22:15:10 GMT 2021

> -----Original Message-----
> From: Vincent Guittot [mailto:vincent.guittot at linaro.org]
> Sent: Tuesday, March 9, 2021 12:26 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua at hisilicon.com>
> Cc: Tim Chen <tim.c.chen at linux.intel.com>; Catalin Marinas
> <catalin.marinas at arm.com>; Will Deacon <will at kernel.org>; Rafael J. Wysocki
> <rjw at rjwysocki.net>; Borislav Petkov <bp at alien8.de>; Thomas Gleixner
> <tglx at linutronix.de>; Ingo Molnar <mingo at redhat.com>; Cc: Len Brown
> <lenb at kernel.org>; Peter Zijlstra <peterz at infradead.org>; Dietmar Eggemann
> <dietmar.eggemann at arm.com>; Steven Rostedt <rostedt at goodmis.org>; Ben Segall
> <bsegall at google.com>; Mel Gorman <mgorman at suse.de>; Juri Lelli
> <juri.lelli at redhat.com>; Mark Rutland <mark.rutland at arm.com>; Aubrey Li
> <aubrey.li at linux.intel.com>; H. Peter Anvin <hpa at zytor.com>; Zengtao (B)
> <prime.zeng at hisilicon.com>; Guodong Xu <guodong.xu at linaro.org>;
> gregkh at linuxfoundation.org; Sudeep Holla <sudeep.holla at arm.com>; linux-kernel
> <linux-kernel at vger.kernel.org>; linuxarm at openeuler.org; ACPI Devel Maling
> List <linux-acpi at vger.kernel.org>; xuwei (O) <xuwei5 at huawei.com>; Jonathan
> Cameron <jonathan.cameron at huawei.com>; yangyicong <yangyicong at huawei.com>;
> x86 <x86 at kernel.org>; msys.mizuma at gmail.com; Liguozhu (Kenneth)
> <liguozhu at hisilicon.com>; Valentin Schneider <valentin.schneider at arm.com>;
> LAK <linux-arm-kernel at lists.infradead.org>
> Subject: Re: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters
> 
> On Tue, 2 Mar 2021 at 00:08, Barry Song <song.bao.hua at hisilicon.com> wrote:
> >
> > ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus. This means cache coherence overhead inside one
> > cluster is much less than the overhead across clusters.
> >
> > This patch adds the sched_domain for clusters. On kunpeng 920, without
> > this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
> > patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
> >
> > This will help spread unrelated tasks among clusters, thus decrease the
> > contention and improve the throughput, for example, stream benchmark can
> > improve around 4.3%~6.3% by this patch:
> >
> > w/o patch:
> > numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.36 nanoseconds
> > STREAM copy bandwidth: 57072.50 MB/sec
> > STREAM scale latency: 3.40 nanoseconds
> > STREAM scale bandwidth: 56542.52 MB/sec
> > STREAM add latency: 5.10 nanoseconds
> > STREAM add bandwidth: 56482.83 MB/sec
> > STREAM triad latency: 5.14 nanoseconds
> > STREAM triad bandwidth: 56069.52 MB/sec
> >
> > w/ patch:
> > $ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.22 nanoseconds
> > STREAM copy bandwidth: 59660.96 MB/sec    ->  +4.5%
> > STREAM scale latency: 3.25 nanoseconds
> > STREAM scale bandwidth: 59002.29 MB/sec   ->  +4.3%
> > STREAM add latency: 4.80 nanoseconds
> > STREAM add bandwidth: 60036.62 MB/sec     ->  +6.3%
> > STREAM triad latency: 4.86 nanoseconds
> > STREAM triad bandwidth: 59228.30 MB/sec   ->  +5.6%
> >
> > On the other hand, while doing WAKE_AFFINE, this patch will try to find
> > a core in the target cluster before scanning the whole llc domain. So it
> > helps gather related tasks within one cluster.
> 
> Could you split this patch in 2 patches ? One for adding a cluster
> sched domain level and one for modifying the wake up path ?

Yes. If this is helpful, I would like to split into two patches.

> 
> This would ease the review and I would be curious about the impact of
> each feature in the performance. In particular, I'm still not
> convinced that the modification of the wakeup path is the root of the
> hackbench improvement; especially with g=14 where there should not be
> much idle CPUs with 14*40 tasks on at most 32 CPUs.  IIRC, there was

My understanding is that threads could be blocked due to pipe. So CPUs
still have some chance to be idle for a big g. Also note the default g
of hackbench is 10.

Anyway, i'd like to add some tracepoints to get the percentages of how
many are picked from cluster, how many are selected from cpus outside
cluster.

> no obvious improvement with the changes in select_idle_cpu unless you
> hack the behavior to not fall back to llc domain
> 

You have a good memory. In a very old version I once mentioned that. But
at that time, I didn't decrease nr after scanning cluster, so it was
scanning at least 8 cpus(4 within cluster, 4 outside cluster). I guess
that is the reason my hack to not fall back to llc domain could bringing
actual hackbench improvement.

> > we run the below hackbench with different -g parameter from 2 to 14, for
> > each different g, we run the command 10 times and get the average time
> > $ numactl -N 0 hackbench -p -T -l 20000 -g $1
> >
> > hackbench will report the time which is needed to complete a certain number
> > of messages transmissions between a certain number of tasks, for example:
> > $ numactl -N 0 hackbench -p -T -l 20000 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)
> > Each sender will pass 20000 messages of 100 bytes
> > Time: 8.874
> >
> > The below is the result of hackbench w/ and w/o the patch:
> > g=    2      4     6       8      10     12      14
> > w/o: 1.9596 4.0506 5.9654 8.0068 9.8147 11.4900 13.1163
> > w/ : 1.9362 3.9197 5.6570 7.1376 8.5263 10.0512 11.3256
> >             +3.3%  +5.2%  +10.9% +13.2%  +12.8%  +13.7%
> >
> > Signed-off-by: Barry Song <song.bao.hua at hisilicon.com>
> > ---
> > -v4:
> >   * rebased to tip/sched/core with the latest unified code of select_idle_cpu
> >   * also added benchmark data of spreading unrelated tasks
> >   * avoided the iteration of sched_domain by moving to static_key(addressing
> >     Vincent's comment
> >
> >  arch/arm64/Kconfig             |  7 +++++
> >  include/linux/sched/cluster.h  | 19 ++++++++++++
> >  include/linux/sched/sd_flags.h |  9 ++++++
> >  include/linux/sched/topology.h |  7 +++++
> >  include/linux/topology.h       |  7 +++++
> >  kernel/sched/core.c            | 18 ++++++++++++
> >  kernel/sched/fair.c            | 66
> +++++++++++++++++++++++++++++++++---------
> >  kernel/sched/sched.h           |  1 +
> >  kernel/sched/topology.c        |  6 ++++
> >  9 files changed, 126 insertions(+), 14 deletions(-)
> >  create mode 100644 include/linux/sched/cluster.h
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index f39568b..158b0fa 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -971,6 +971,13 @@ config SCHED_MC
> >           making when dealing with multi-core CPU chips at a cost of slightly
> >           increased overhead in some places. If unsure say N here.
> >
> > +config SCHED_CLUSTER
> > +       bool "Cluster scheduler support"
> > +       help
> > +         Cluster scheduler support improves the CPU scheduler's decision
> > +         making when dealing with machines that have clusters(sharing internal
> > +         bus or sharing LLC cache tag). If unsure say N here.
> > +
> >  config SCHED_SMT
> >         bool "SMT scheduler support"
> >         help
> > diff --git a/include/linux/sched/cluster.h b/include/linux/sched/cluster.h
> > new file mode 100644
> > index 0000000..ea6c475
> > --- /dev/null
> > +++ b/include/linux/sched/cluster.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_SCHED_CLUSTER_H
> > +#define _LINUX_SCHED_CLUSTER_H
> > +
> > +#include <linux/static_key.h>
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +extern struct static_key_false sched_cluster_present;
> > +
> > +static __always_inline bool sched_cluster_active(void)
> > +{
> > +       return static_branch_likely(&sched_cluster_present);
> > +}
> > +#else
> > +static inline bool sched_cluster_active(void) { return false; }
> > +
> > +#endif
> > +
> > +#endif
> > diff --git a/include/linux/sched/sd_flags.h
> b/include/linux/sched/sd_flags.h
> > index 34b21e9..fc3c894 100644
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -100,6 +100,15 @@
> >  SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> >
> >  /*
> > + * Domain members share CPU cluster resources (i.e. llc cache tags)
> > + *
> > + * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer
> share
> > + *               the cluster resouces (such as llc tags and internal bus)
> > + * NEEDS_GROUPS: Caches are shared between groups.
> > + */
> > +SD_FLAG(SD_SHARE_CLS_RESOURCES, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> > +
> > +/*
> >   * Domain members share CPU package resources (i.e. caches)
> >   *
> >   * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer
> share
> > diff --git a/include/linux/sched/topology.h
> b/include/linux/sched/topology.h
> > index 8f0f778..846fcac 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +static inline int cpu_cluster_flags(void)
> > +{
> > +       return SD_SHARE_CLS_RESOURCES | SD_SHARE_PKG_RESOURCES;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_MC
> >  static inline int cpu_core_flags(void)
> >  {
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 80d27d7..0b3704a 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int
> cpu)
> >  }
> >  #endif
> >
> > +#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
> > +static inline const struct cpumask *cpu_cluster_mask(int cpu)
> > +{
> > +       return topology_cluster_cpumask(cpu);
> > +}
> > +#endif
> > +
> >  static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  {
> >         return cpumask_of_node(cpu_to_node(cpu));
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 88a2e2b..d805e59 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7797,6 +7797,16 @@ int sched_cpu_activate(unsigned int cpu)
> >         if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> >                 static_branch_inc_cpuslocked(&sched_smt_present);
> >  #endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       /*
> > +        * When going up, increment the number of cluster cpus with
> > +        * cluster present.
> > +        */
> > +       if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +               static_branch_inc_cpuslocked(&sched_cluster_present);
> > +#endif
> > +
> >         set_cpu_active(cpu, true);
> >
> >         if (sched_smp_initialized) {
> > @@ -7873,6 +7883,14 @@ int sched_cpu_deactivate(unsigned int cpu)
> >                 static_branch_dec_cpuslocked(&sched_smt_present);
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       /*
> > +        * When going down, decrement the number of cpus with cluster present.
> > +        */
> > +       if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +               static_branch_dec_cpuslocked(&sched_cluster_present);
> > +#endif
> > +
> >         if (!sched_smp_initialized)
> >                 return 0;
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b..3db7b07 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6009,6 +6009,11 @@ static inline int __select_idle_cpu(int cpu)
> >         return -1;
> >  }
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +DEFINE_STATIC_KEY_FALSE(sched_cluster_present);
> > +EXPORT_SYMBOL_GPL(sched_cluster_present);
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >  EXPORT_SYMBOL_GPL(sched_smt_present);
> > @@ -6116,6 +6121,26 @@ static inline int select_idle_core(struct task_struct
> *p, int core, struct cpuma
> >
> >  #endif /* CONFIG_SCHED_SMT */
> >
> > +static inline int _select_idle_cpu(bool smt, struct task_struct *p, int
> target, struct cpumask *cpus, int *idle_cpu, int *nr)
> > +{
> > +       int cpu, i;
> > +
> > +       for_each_cpu_wrap(cpu, cpus, target) {
> > +               if (smt) {
> > +                       i = select_idle_core(p, cpu, cpus, idle_cpu);
> > +               } else {
> > +                       if (!--*nr)
> > +                               return -1;
> > +                       i = __select_idle_cpu(cpu);
> > +               }
> > +
> > +               if ((unsigned int)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> > +       return -1;
> > +}
> > +
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against
> the
> > @@ -6124,7 +6149,7 @@ static inline int select_idle_core(struct task_struct
> *p, int core, struct cpuma
> >  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
> int target)
> >  {
> >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > +       int i, idle_cpu = -1, nr = INT_MAX;
> >         bool smt = test_idle_cores(target, false);
> >         int this = smp_processor_id();
> >         struct sched_domain *this_sd;
> > @@ -6134,7 +6159,12 @@ static int select_idle_cpu(struct task_struct *p,
> struct sched_domain *sd, int t
> >         if (!this_sd)
> >                 return -1;
> >
> > -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +       if (!sched_cluster_active())
> > +               cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       if (sched_cluster_active())
> > +               cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr);
> > +#endif
> >
> >         if (sched_feat(SIS_PROP) && !smt) {
> >                 u64 avg_cost, avg_idle, span_avg;
> > @@ -6155,24 +6185,32 @@ static int select_idle_cpu(struct task_struct *p,
> struct sched_domain *sd, int t
> >                 time = cpu_clock(this);
> >         }
> >
> > -       for_each_cpu_wrap(cpu, cpus, target) {
> > -               if (smt) {
> > -                       i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > -                       if ((unsigned int)i < nr_cpumask_bits)
> > -                               return i;
> > +       /* scan cluster before scanning the whole llc */
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       if (sched_cluster_active()) {
> > +               i = _select_idle_cpu(smt, p, target, cpus, &idle_cpu, &nr);
> > +               if ((unsigned int) i < nr_cpumask_bits) {
> > +                       idle_cpu = i;
> > +                       goto done;
> > +               } else if (nr <= 0)
> > +                       return -1;
> >
> > -               } else {
> > -                       if (!--nr)
> > -                               return -1;
> > -                       idle_cpu = __select_idle_cpu(cpu);
> > -                       if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > -                               break;
> > -               }
> > +               cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +               cpumask_andnot(cpus, cpus, cpu_cluster_mask(target));
> >         }
> > +#endif
> > +
> > +       i = _select_idle_cpu(smt, p, target, cpus, &idle_cpu, &nr);
> > +       if ((unsigned int) i < nr_cpumask_bits) {
> > +               idle_cpu = i;
> > +               goto done;
> > +       } else if (nr <= 0)
> > +               return -1;
> >
> >         if (smt)
> >                 set_idle_cores(this, false);
> >
> > +done:
> >         if (sched_feat(SIS_PROP) && !smt) {
> >                 time = cpu_clock(this) - time;
> >                 update_avg(&this_sd->avg_scan_cost, time);
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 10a1522..48a020f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -6,6 +6,7 @@
> >
> >  #include <linux/sched/autogroup.h>
> >  #include <linux/sched/clock.h>
> > +#include <linux/sched/cluster.h>
> >  #include <linux/sched/coredump.h>
> >  #include <linux/sched/cpufreq.h>
> >  #include <linux/sched/cputime.h>
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 09d3504..d019c25 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1361,6 +1361,7 @@ static void claim_allocations(int cpu, struct
> sched_domain *sd)
> >   */
> >  #define TOPOLOGY_SD_FLAGS              \
> >         (SD_SHARE_CPUCAPACITY   |       \
> > +        SD_SHARE_CLS_RESOURCES |       \
> >          SD_SHARE_PKG_RESOURCES |       \
> >          SD_NUMA                |       \
> >          SD_ASYM_PACKING)
> > @@ -1480,6 +1481,11 @@ static void claim_allocations(int cpu, struct
> sched_domain *sd)
> >  #ifdef CONFIG_SCHED_SMT
> >         { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> >  #endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_MC
> >         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> >  #endif
> > --
> > 1.8.3.1
> >
> >
Thanks
Barry