[PATCH v9 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Mon Aug 7 06:15:13 PDT 2023

On 2023/8/5 0:29, Chen Yu wrote:
> Hi Yicong,
> 
> On 2023-08-01 at 20:06:56 +0800, Yicong Yang wrote:
>> Hi Chenyu,
>>
>> Sorry for the late reply. Something's wrong and cause this didn't appear
>> in my mail box. I check it out on the LKML.
>>
>  
> No worries : )
> 
>> On 2023/7/21 17:52, Chen Yu wrote:
>>> Hi Yicong,
>>>
>>> Thanks for sending this version!
>>>
>>> On 2023-07-19 at 17:28:38 +0800, Yicong Yang wrote:
>>>> From: Barry Song <song.bao.hua at hisilicon.com>
>>>>
>>>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>>>> have lower latency when synchronizing and accessing shared resources like
>>>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>>>> target CPU before scanning the whole LLC to gain lower latency. This
>>>> will be implemented in 3 steps in select_idle_sibling():
>>>> 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
>>>>    if they're sharing cluster with the target CPU. Otherwise record them
>>>>    and do the scanning first.
>>>> 2. Scanning the cluster prior to the LLC of the target CPU for an
>>>>    idle CPU to wakeup.
>>>> 3. If no idle CPU found after scanning and the prev_cpu/recent_used_cpu
>>>>    can be used, use them.
>>>>
>>>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>>>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>>>
>>>> With this patch, We noticed enhancement on tbench and netperf within one
>>>> numa or cross two numa on 6.5-rc1:
>>>> tbench results (node 0):
>>>>              baseline                    patched
>>>>   1:        325.9673        378.9117 (   16.24%)
>>>>   4:       1311.9667       1501.5033 (   14.45%)
>>>>   8:       2629.4667       2961.9100 (   12.64%)
>>>>  16:       5259.1633       5928.0833 (   12.72%)
>>>>  32:      10368.6333      10566.8667 (    1.91%)
>>>>  64:       7868.7700       8182.0100 (    3.98%)
>>>> 128:       6528.5733       6801.8000 (    4.19%)
>>>> tbench results (node 0-1):
>>>>               vanilla                    patched
>>>>   1:        329.2757        380.8907 (   15.68%)
>>>>   4:       1327.7900       1494.5300 (   12.56%)
>>>>   8:       2627.2133       2917.1233 (   11.03%)
>>>>  16:       5201.3367       5835.9233 (   12.20%)
>>>>  32:       8811.8500      11154.2000 (   26.58%)
>>>>  64:      15832.4000      19643.7667 (   24.07%)
>>>> 128:      12605.5667      14639.5667 (   16.14%)
>>>> netperf results TCP_RR (node 0):
>>>>              baseline                    patched
>>>>   1:      77302.8667      92172.2100 (   19.24%)
>>>>   4:      78724.9200      91581.3100 (   16.33%)
>>>>   8:      79168.1296      91091.7942 (   15.06%)
>>>>  16:      81079.4200      90546.5225 (   11.68%)
>>>>  32:      82201.5799      78910.4982 (   -4.00%)
>>>>  64:      29539.3509      29131.4698 (   -1.38%)
>>>> 128:      12082.7522      11956.7705 (   -1.04%)
>>>> netperf results TCP_RR (node 0-1):
>>>>              baseline                    patched
>>>>   1:      78340.5233      92101.8733 (   17.57%)
>>>>   4:      79644.2483      91326.7517 (   14.67%)
>>>>   8:      79557.4313      90737.8096 (   14.05%)
>>>>  16:      79215.5304      90568.4542 (   14.33%)
>>>>  32:      78999.3983      85460.6044 (    8.18%)
>>>>  64:      74198.9494      74325.4361 (    0.17%)
>>>> 128:      27397.4810      27757.5471 (    1.31%)
>>>> netperf results UDP_RR (node 0):
>>>>              baseline                    patched
>>>>   1:      95721.9367     111546.1367 (   16.53%)
>>>>   4:      96384.2250     110036.1408 (   14.16%)
>>>>   8:      97460.6546     109968.0883 (   12.83%)
>>>>  16:      98876.1687     109387.8065 (   10.63%)
>>>>  32:     104364.6417     105241.6767 (    0.84%)
>>>>  64:      37502.6246      37451.1204 (   -0.14%)
>>>> 128:      14496.1780      14610.5538 (    0.79%)
>>>> netperf results UDP_RR (node 0-1):
>>>>              baseline                    patched
>>>>   1:      96176.1633     111397.5333 (   15.83%)
>>>>   4:      94758.5575     105681.7833 (   11.53%)
>>>>   8:      94340.2200     104138.3613 (   10.39%)
>>>>  16:      95208.5285     106714.0396 (   12.08%)
>>>>  32:      74745.9028     100713.8764 (   34.74%)
>>>>  64:      59351.4977      73536.1434 (   23.90%)
>>>> 128:      23755.4971      26648.7413 (   12.18%)
>>>>
>>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
>>>> in the code has not been tested but it supposed to work.
>>>>
>>>> Chen Yu also noticed this will improve the performance of tbench and
>>>> netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
>>>> cluster sharing L2 Cache.
>>>>
>>>> Suggested-by: Peter Zijlstra <peterz at infradead.org>
>>>> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
>>>> Tested-by: Yicong Yang <yangyicong at hisilicon.com>
>>>> Signed-off-by: Barry Song <song.bao.hua at hisilicon.com>
>>>> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
>>>> Reviewed-by: Tim Chen <tim.c.chen at linux.intel.com>
>>>> Reviewed-by: Chen Yu <yu.c.chen at intel.com>
>>>> ---
>>>>  kernel/sched/fair.c     | 59 +++++++++++++++++++++++++++++++++++++----
>>>>  kernel/sched/sched.h    |  1 +
>>>>  kernel/sched/topology.c | 12 +++++++++
>>>>  3 files changed, 67 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b3e25be58e2b..d91bf64f81f5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7012,6 +7012,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>  		}
>>>>  	}
>>>>  
>>>> +	if (static_branch_unlikely(&sched_cluster_active)) {
>>>> +		struct sched_group *sg = sd->groups;
>>>> +
>>>> +		if (sg->flags & SD_CLUSTER) {
>>>> +			for_each_cpu_wrap(cpu, sched_group_span(sg), target + 1) {
>>>> +				if (!cpumask_test_cpu(cpu, cpus))
>>>> +					continue;
>>>> +
>>>> +				if (has_idle_core) {
>>>> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>> +					if ((unsigned int)i < nr_cpumask_bits)
>>>> +						return i;
>>>> +				} else {
>>>> +					if (--nr <= 0)
>>>> +						return -1;
>>>> +					idle_cpu = __select_idle_cpu(cpu, p);
>>>> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>>> +						return idle_cpu;
>>>> +				}
>>>> +			}
>>>> +			cpumask_andnot(cpus, cpus, sched_group_span(sg));
>>>> +		}
>>>> +	}
>>>> +
>>>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>>>  		if (has_idle_core) {
>>>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>> @@ -7019,7 +7043,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>  				return i;
>>>>  
>>>>  		} else {
>>>> -			if (!--nr)
>>>> +			if (--nr <= 0)
>>>>  				return -1;
>>>>  			idle_cpu = __select_idle_cpu(cpu, p);
>>>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>>> @@ -7121,7 +7145,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>>  	bool has_idle_core = false;
>>>>  	struct sched_domain *sd;
>>>>  	unsigned long task_util, util_min, util_max;
>>>> -	int i, recent_used_cpu;
>>>> +	int i, recent_used_cpu, prev_aff = -1;
>>>>  
>>>>  	/*
>>>>  	 * On asymmetric system, update task utilization because we will check
>>>> @@ -7148,8 +7172,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>>  	 */
>>>>  	if (prev != target && cpus_share_cache(prev, target) &&
>>>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>>> -	    asym_fits_cpu(task_util, util_min, util_max, prev))
>>>> -		return prev;
>>>> +	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
>>>> +		if (!static_branch_unlikely(&sched_cluster_active))
>>>> +			return prev;
>>>> +
>>>> +		if (cpus_share_resources(prev, target))
>>>> +			return prev;
>>>
>>> I have one minor question, previously Peter mentioned that he wants to get rid of the
>>> percpu sd_share_id, not sure if he means that not using it in select_idle_cpu()
>>> or remove that variable completely to not introduce extra space? 
>>> Hi Peter, could you please give us more hints on this? thanks.
>>>
>>> If we wants to get rid of this variable, would this work?
>>>
>>> 	if ((sd->groups->flags & SD_CLUSTER) &&
>>> 	    cpumask_test_cpu(prev, sched_group_span(sd->groups))
>>> 		return prev
>>>
>>
>> In the current implementation, nop, we haven't deferenced the @sd yet and we don't
>> need to if scanning is not needed.
>>
>> Since we're on the quick path without scanning here, I wonder it'll be a bit more
>> efficient to use a per-cpu id rather than deference the rcu and do the bitmap
>> computation.
>>
> 
> Dereference is a memory barrier and the bitmap is of one operation/instruction which
> should not have too much overhead. But anyway I've tested this patch on Jacobsville
> and the data looks OK to me:
> 
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	6-threads	 1.00 (  0.84)	 -0.32 (  0.71)
> TCP_RR          	12-threads	 1.00 (  0.35)	 +1.52 (  0.42)
> TCP_RR          	18-threads	 1.00 (  0.31)	 +3.89 (  0.38)
> TCP_RR          	24-threads	 1.00 (  0.87)	 -0.34 (  0.75)
> TCP_RR          	30-threads	 1.00 (  5.84)	 +0.71 (  4.85)
> TCP_RR          	36-threads	 1.00 (  4.84)	 +0.24 (  3.30)
> TCP_RR          	42-threads	 1.00 (  3.75)	 +0.26 (  3.56)
> TCP_RR          	48-threads	 1.00 (  1.51)	 +0.45 (  1.28)
> UDP_RR          	6-threads	 1.00 (  0.65)	+10.12 (  0.63)
> UDP_RR          	12-threads	 1.00 (  0.20)	 +9.91 (  0.25)
> UDP_RR          	18-threads	 1.00 ( 11.13)	+16.77 (  0.49)
> UDP_RR          	24-threads	 1.00 ( 12.38)	 +2.52 (  0.98)
> UDP_RR          	30-threads	 1.00 (  5.63)	 -0.34 (  4.38)
> UDP_RR          	36-threads	 1.00 ( 19.12)	 -0.89 (  3.30)
> UDP_RR          	42-threads	 1.00 (  2.96)	 -1.41 (  3.17)
> UDP_RR          	48-threads	 1.00 ( 14.08)	 -0.77 ( 10.77)
> 
> Good improvement in several cases. No regression is detected.
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	6-threads	 1.00 (  0.41)	 +1.63 (  0.17)
> loopback        	12-threads	 1.00 (  0.18)	 +4.39 (  0.12)
> loopback        	18-threads	 1.00 (  0.43)	+10.42 (  0.18)
> loopback        	24-threads	 1.00 (  0.38)	 +1.24 (  0.38)
> loopback        	30-threads	 1.00 (  0.24)	 +0.60 (  0.14)
> loopback        	36-threads	 1.00 (  0.17)	 +0.63 (  0.17)
> loopback        	42-threads	 1.00 (  0.26)	 +0.76 (  0.08)
> loopback        	48-threads	 1.00 (  0.23)	 +0.91 (  0.10)
> 
> Good improvement in 18-threads case. No regression is detected.
> 
> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  0.52)	 +9.26 (  0.57)
> process-pipe    	2-groups	 1.00 (  1.55)	 +6.92 (  0.56)
> process-pipe    	4-groups	 1.00 (  1.36)	 +4.80 (  3.78)
> process-sockets 	1-groups	 1.00 (  2.16)	 -6.35 (  1.10)
> process-sockets 	2-groups	 1.00 (  2.34)	 -6.35 (  5.52)
> process-sockets 	4-groups	 1.00 (  0.35)	 -5.64 (  1.19)
> threads-pipe    	1-groups	 1.00 (  0.82)	 +8.00 (  0.00)
> threads-pipe    	2-groups	 1.00 (  0.47)	 +6.91 (  0.50)
> threads-pipe    	4-groups	 1.00 (  0.45)	 +8.92 (  2.27)
> threads-sockets 	1-groups	 1.00 (  1.02)	 -4.13 (  2.30)
> threads-sockets 	2-groups	 1.00 (  0.34)	 -1.86 (  2.39)
> threads-sockets 	4-groups	 1.00 (  1.51)	 -3.99 (  1.59)
> 
> Pros and cons for hackbench. There is improvement for pipe mode, but
> slight regression on sockets mode. I think this is within acceptable range.
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
> normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
> normal          	4-mthreads	 1.00 (  3.82)	 +0.00 (  3.82)
> 
> There is impact to schbench at all, and the results are quite stable.
> 
> For the whole series:
> 
> Tested-by: Chen Yu <yu.c.chen at intel.com>
> 

Thanks for testing.

Yicong.