[PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Mon Aug 22 20:45:34 PDT 2022

On 2022-08-22 at 15:36:10 +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua at hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> have lower latency when synchronizing and accessing shared resources like
> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> target CPU before scanning the whole LLC to gain lower latency.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one numa or cross
> two numa.
> 
> On numa 0:
>                              6.0-rc1                patched
> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
> 
> On numa 0-1:
>                              6.0-rc1                patched
> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
> in the code has not been tested but it supposed to work.
> 
> Suggested-by: Peter Zijlstra <peterz at infradead.org>
> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
> Tested-by: Yicong Yang <yangyicong at hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua at hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
> Reviewed-by: Tim Chen <tim.c.chen at linux.intel.com>
> ---
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h    |  2 ++
>  kernel/sched/topology.c | 10 ++++++++++
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 914096c5b1ae..6fa77610d0f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6437,6 +6437,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&sched_cluster_active)) {
> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
> +
> +		if (sdc) {
> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
Looks good to me. One minor question, why don't we use
cpumask_and(cpus, sched_domain_span(sdc), cpus);
> +				if (!cpumask_test_cpu(cpu, cpus))
> +					continue;
so above check can be removed in each loop? Besides may I know what version this patch
is based on? since I failed to apply the patch on v6.0-rc2. Other than that:

Reviewed-by: Chen Yu <yu.c.chen at intel.com>

thanks,
Chenyu
> +
> +				if (has_idle_core) {
> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
> +					if ((unsigned int)i < nr_cpumask_bits)
> +						return i;
> +				} else {
> +					if (--nr <= 0)
> +						return -1;
> +					idle_cpu = __select_idle_cpu(cpu, p);
> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +						return idle_cpu;
> +				}
> +			}
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
> +		}
> +	}