[RFC PATCH v2 2/2] scheduler: add scheduler level for clusters

Wed Dec 9 06:35:30 EST 2020

> -----Original Message-----
> From: Vincent Guittot [mailto:vincent.guittot at linaro.org]
> Sent: Tuesday, December 8, 2020 4:29 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua at hisilicon.com>
> Cc: Valentin Schneider <valentin.schneider at arm.com>; Catalin Marinas
> <catalin.marinas at arm.com>; Will Deacon <will at kernel.org>; Rafael J. Wysocki
> <rjw at rjwysocki.net>; Cc: Len Brown <lenb at kernel.org>;
> gregkh at linuxfoundation.org; Jonathan Cameron <jonathan.cameron at huawei.com>;
> Ingo Molnar <mingo at redhat.com>; Peter Zijlstra <peterz at infradead.org>; Juri
> Lelli <juri.lelli at redhat.com>; Dietmar Eggemann <dietmar.eggemann at arm.com>;
> Steven Rostedt <rostedt at goodmis.org>; Ben Segall <bsegall at google.com>; Mel
> Gorman <mgorman at suse.de>; Mark Rutland <mark.rutland at arm.com>; LAK
> <linux-arm-kernel at lists.infradead.org>; linux-kernel
> <linux-kernel at vger.kernel.org>; ACPI Devel Maling List
> <linux-acpi at vger.kernel.org>; Linuxarm <linuxarm at huawei.com>; xuwei (O)
> <xuwei5 at huawei.com>; Zengtao (B) <prime.zeng at hisilicon.com>
> Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters
> 
> On Mon, 7 Dec 2020 at 10:59, Song Bao Hua (Barry Song)
> <song.bao.hua at hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Vincent Guittot [mailto:vincent.guittot at linaro.org]
> > > Sent: Thursday, December 3, 2020 10:39 PM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua at hisilicon.com>
> > > Cc: Valentin Schneider <valentin.schneider at arm.com>; Catalin Marinas
> > > <catalin.marinas at arm.com>; Will Deacon <will at kernel.org>; Rafael J. Wysocki
> > > <rjw at rjwysocki.net>; Cc: Len Brown <lenb at kernel.org>;
> > > gregkh at linuxfoundation.org; Jonathan Cameron
> <jonathan.cameron at huawei.com>;
> > > Ingo Molnar <mingo at redhat.com>; Peter Zijlstra <peterz at infradead.org>; Juri
> > > Lelli <juri.lelli at redhat.com>; Dietmar Eggemann
> <dietmar.eggemann at arm.com>;
> > > Steven Rostedt <rostedt at goodmis.org>; Ben Segall <bsegall at google.com>; Mel
> > > Gorman <mgorman at suse.de>; Mark Rutland <mark.rutland at arm.com>; LAK
> > > <linux-arm-kernel at lists.infradead.org>; linux-kernel
> > > <linux-kernel at vger.kernel.org>; ACPI Devel Maling List
> > > <linux-acpi at vger.kernel.org>; Linuxarm <linuxarm at huawei.com>; xuwei (O)
> > > <xuwei5 at huawei.com>; Zengtao (B) <prime.zeng at hisilicon.com>
> > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters
> > >
> > > On Thu, 3 Dec 2020 at 10:11, Song Bao Hua (Barry Song)
> > > <song.bao.hua at hisilicon.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Vincent Guittot [mailto:vincent.guittot at linaro.org]
> > > > > Sent: Thursday, December 3, 2020 10:04 PM
> > > > > To: Song Bao Hua (Barry Song) <song.bao.hua at hisilicon.com>
> > > > > Cc: Valentin Schneider <valentin.schneider at arm.com>; Catalin Marinas
> > > > > <catalin.marinas at arm.com>; Will Deacon <will at kernel.org>; Rafael J.
> Wysocki
> > > > > <rjw at rjwysocki.net>; Cc: Len Brown <lenb at kernel.org>;
> > > > > gregkh at linuxfoundation.org; Jonathan Cameron
> > > <jonathan.cameron at huawei.com>;
> > > > > Ingo Molnar <mingo at redhat.com>; Peter Zijlstra <peterz at infradead.org>;
> Juri
> > > > > Lelli <juri.lelli at redhat.com>; Dietmar Eggemann
> > > <dietmar.eggemann at arm.com>;
> > > > > Steven Rostedt <rostedt at goodmis.org>; Ben Segall <bsegall at google.com>;
> Mel
> > > > > Gorman <mgorman at suse.de>; Mark Rutland <mark.rutland at arm.com>; LAK
> > > > > <linux-arm-kernel at lists.infradead.org>; linux-kernel
> > > > > <linux-kernel at vger.kernel.org>; ACPI Devel Maling List
> > > > > <linux-acpi at vger.kernel.org>; Linuxarm <linuxarm at huawei.com>; xuwei
> (O)
> > > > > <xuwei5 at huawei.com>; Zengtao (B) <prime.zeng at hisilicon.com>
> > > > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters
> > > > >
> > > > > On Wed, 2 Dec 2020 at 21:58, Song Bao Hua (Barry Song)
> > > > > <song.bao.hua at hisilicon.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > Sorry. Please ignore this. I added some printk here while testing
> > > > > > > one numa. Will update you the data in another email.
> > > > > >
> > > > > > Re-tested in one NUMA node(cpu0-cpu23):
> > > > > >
> > > > > > g=1
> > > > > > Running in threaded mode with 1 groups using 40 file descriptors
> > > > > > Each sender will pass 100000 messages of 100 bytes
> > > > > > w/o: 7.689 7.485 7.485 7.458 7.524 7.539 7.738 7.693 7.568 7.674=7.5853
> > > > > > w/ : 7.516 7.941 7.374 7.963 7.881 7.910 7.420 7.556 7.695 7.441=7.6697
> > > > > > w/ but dropped select_idle_cluster:
> > > > > >      7.752 7.739 7.739 7.571 7.545 7.685 7.407 7.580 7.605 7.487=7.611
> > > > > >
> > > > > > g=2
> > > > > > Running in threaded mode with 2 groups using 40 file descriptors
> > > > > > Each sender will pass 100000 messages of 100 bytes
> > > > > > w/o: 10.127 10.119 10.070 10.196 10.057 10.111 10.045 10.164 10.162
> > > > > > 9.955=10.1006
> > > > > > w/ : 9.694 9.654 9.612 9.649 9.686 9.734 9.607 9.842 9.690 9.710=9.6878
> > > > > > w/ but dropped select_idle_cluster:
> > > > > >      9.877 10.069 9.951 9.918 9.947 9.790 9.906 9.820 9.863 9.906=9.9047
> > > > > >
> > > > > > g=3
> > > > > > Running in threaded mode with 3 groups using 40 file descriptors
> > > > > > Each sender will pass 100000 messages of 100 bytes
> > > > > > w/o: 15.885 15.254 15.932 15.647 16.120 15.878 15.857 15.759 15.674
> > > > > > 15.721=15.7727
> > > > > > w/ : 14.974 14.657 13.969 14.985 14.728 15.665 15.191 14.995 14.946
> > > > > > 14.895=14.9005
> > > > > > w/ but dropped select_idle_cluster:
> > > > > >      15.405 15.177 15.373 15.187 15.450 15.540 15.278 15.628 15.228
> > > > > 15.325=15.3591
> > > > > >
> > > > > > g=4
> > > > > > Running in threaded mode with 4 groups using 40 file descriptors
> > > > > > Each sender will pass 100000 messages of 100 bytes
> > > > > > w/o: 20.014 21.025 21.119 21.235 19.767 20.971 20.962 20.914 21.090
> > > > > 21.090=20.8187
> > > > > > w/ : 20.331 20.608 20.338 20.445 20.456 20.146 20.693 20.797 21.381
> > > > > 20.452=20.5647
> > > > > > w/ but dropped select_idle_cluster:
> > > > > >      19.814 20.126 20.229 20.350 20.750 20.404 19.957 19.888 20.226
> > > > > 20.562=20.2306
> > > > > >
> > > > >
> > > > > I assume that you have run this on v5.9 as previous tests.
> > > >
> > > > Yep
> > > >
> > > > > The results don't show any real benefit of select_idle_cluster()
> > > > > inside a node whereas this is where we could expect most of the
> > > > > benefit. We have to understand why we have such an impact on numa
> > > > > tests only.
> > > >
> > > > There is a 4-5.5% increase while g=2 and g=3.
> > >
> > > my point was with vs without select_idle_cluster() but still having a
> > > cluster domain level
> > > In this case, the diff is -0.8% for g=1 +2.2% for g=2, +3% for g=3 and
> > > -1.7% for g=4
> > >
> > > >
> > > > Regarding the huge increase in NUMA case,  at the first beginning, I suspect
> > > > we have wrong llc domain. For example, if cpu0's llc domain span
> > > > cpu0-cpu47, then select_idle_cpu() is running in wrong range while
> > > > it should run in cpu0-cpu23.
> > > >
> > > > But after printing the llc domain's span, I find it is completely right.
> > > > Cpu0's llc span: cpu0-cpu23
> > > > Cpu24's llc span: cpu24-cpu47
> > >
> > > Have you checked that the cluster mask was also correct ?
> > >
> > > >
> > > > Maybe I need more trace data to figure out if select_idle_cpu() is running
> > > > correctly. For example, maybe I can figure out if it is always returning
> -1,
> > > > or it returns -1 very often?
> > >
> > > yes, could be interesting to check how often select_idle_cpu return -1
> > >
> > > >
> > > > Or do you have any idea?
> > >
> > > tracking migration across nod could help to understand too
> >
> > I set a bootargs mem=4G to do swapping test before working on cluster
> > scheduler issue. but I forgot to remove the parameter.
> >
> > The huge increase on across-numa case can only be reproduced while
> > i use this mem=4G cmdline which means numa1 has no memory.
> > After removing the limitation, I can't reproduce the huge increase
> > for two NUMAs any more.
> 
> Ok. Make more sense

I managed to use linux-next to test after fixing the disk hang.

But I am still quite struggling with how to leverage the cluster
topology in select_idle_cpu() to make huge improvement on benchmark.

If I disable the influence of scheduler by taskset, there is
obviously a large difference in hackbench inside cluster and
across clusters:

inside a cluster:
root at ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

Across clusters:
root at ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

But no matter how I tune the code of kernel/sched/fair.c, I
don't see this large difference by running hackbench in the
whole numa node:
for i in {1..10}
do
	numactl -N 0 hackbench -p -T -l 20000 -g $1
done

usually, the difference is under (-5%~+5%).

Then I made a major change as below:
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
	...

	time = cpu_clock(this);

#if 0
	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

	for_each_cpu_wrap(cpu, cpus, target) {
		if (!--nr)
			return -1;
		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
			break;
	}
#else
	if ((cpu=select_idle_cluster(p,target)) = -1)
		return -1;
#endif

	time = cpu_clock(this) - time;
	update_avg(&this_sd->avg_scan_cost, time);

	return cpu;
}

That means I don't fall back to llc if cluster has no idle
cpu.

With this, I am getting 20% major difference as I am always expecting:

g=     1      2        3        4        5        6        7      8       9        10
w/o  1.5494 2.0641 3.1640 4.2438 5.3445 6.3098 7.5086 8.4721 9.7115  10.8588
w/   1.6801 2.0280 2.7890 3.7339 4.5748 5.2998 6.1413 6.6206 7.7641  8.4782

I guess my original patch is very easy to fall back to llc as
cluster is not easy to idle. Once system is busy, the original
patch is nop as it is always falling back to llc.

> 
> >
> > Guess select_idle_cluster() somehow workaround an scheduler issue
> > for numa without memory.
> >
> > >
> > > Vincent

Thanks
Barry