[PATCH v9 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Tue Aug 1 05:06:56 PDT 2023

Hi Chenyu,

Sorry for the late reply. Something's wrong and cause this didn't appear
in my mail box. I check it out on the LKML.

On 2023/7/21 17:52, Chen Yu wrote:
> Hi Yicong,
> 
> Thanks for sending this version!
> 
> On 2023-07-19 at 17:28:38 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua at hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency. This
>> will be implemented in 3 steps in select_idle_sibling():
>> 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
>>    if they're sharing cluster with the target CPU. Otherwise record them
>>    and do the scanning first.
>> 2. Scanning the cluster prior to the LLC of the target CPU for an
>>    idle CPU to wakeup.
>> 3. If no idle CPU found after scanning and the prev_cpu/recent_used_cpu
>>    can be used, use them.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench and netperf within one
>> numa or cross two numa on 6.5-rc1:
>> tbench results (node 0):
>>              baseline                    patched
>>   1:        325.9673        378.9117 (   16.24%)
>>   4:       1311.9667       1501.5033 (   14.45%)
>>   8:       2629.4667       2961.9100 (   12.64%)
>>  16:       5259.1633       5928.0833 (   12.72%)
>>  32:      10368.6333      10566.8667 (    1.91%)
>>  64:       7868.7700       8182.0100 (    3.98%)
>> 128:       6528.5733       6801.8000 (    4.19%)
>> tbench results (node 0-1):
>>               vanilla                    patched
>>   1:        329.2757        380.8907 (   15.68%)
>>   4:       1327.7900       1494.5300 (   12.56%)
>>   8:       2627.2133       2917.1233 (   11.03%)
>>  16:       5201.3367       5835.9233 (   12.20%)
>>  32:       8811.8500      11154.2000 (   26.58%)
>>  64:      15832.4000      19643.7667 (   24.07%)
>> 128:      12605.5667      14639.5667 (   16.14%)
>> netperf results TCP_RR (node 0):
>>              baseline                    patched
>>   1:      77302.8667      92172.2100 (   19.24%)
>>   4:      78724.9200      91581.3100 (   16.33%)
>>   8:      79168.1296      91091.7942 (   15.06%)
>>  16:      81079.4200      90546.5225 (   11.68%)
>>  32:      82201.5799      78910.4982 (   -4.00%)
>>  64:      29539.3509      29131.4698 (   -1.38%)
>> 128:      12082.7522      11956.7705 (   -1.04%)
>> netperf results TCP_RR (node 0-1):
>>              baseline                    patched
>>   1:      78340.5233      92101.8733 (   17.57%)
>>   4:      79644.2483      91326.7517 (   14.67%)
>>   8:      79557.4313      90737.8096 (   14.05%)
>>  16:      79215.5304      90568.4542 (   14.33%)
>>  32:      78999.3983      85460.6044 (    8.18%)
>>  64:      74198.9494      74325.4361 (    0.17%)
>> 128:      27397.4810      27757.5471 (    1.31%)
>> netperf results UDP_RR (node 0):
>>              baseline                    patched
>>   1:      95721.9367     111546.1367 (   16.53%)
>>   4:      96384.2250     110036.1408 (   14.16%)
>>   8:      97460.6546     109968.0883 (   12.83%)
>>  16:      98876.1687     109387.8065 (   10.63%)
>>  32:     104364.6417     105241.6767 (    0.84%)
>>  64:      37502.6246      37451.1204 (   -0.14%)
>> 128:      14496.1780      14610.5538 (    0.79%)
>> netperf results UDP_RR (node 0-1):
>>              baseline                    patched
>>   1:      96176.1633     111397.5333 (   15.83%)
>>   4:      94758.5575     105681.7833 (   11.53%)
>>   8:      94340.2200     104138.3613 (   10.39%)
>>  16:      95208.5285     106714.0396 (   12.08%)
>>  32:      74745.9028     100713.8764 (   34.74%)
>>  64:      59351.4977      73536.1434 (   23.90%)
>> 128:      23755.4971      26648.7413 (   12.18%)
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
>> in the code has not been tested but it supposed to work.
>>
>> Chen Yu also noticed this will improve the performance of tbench and
>> netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
>> cluster sharing L2 Cache.
>>
>> Suggested-by: Peter Zijlstra <peterz at infradead.org>
>> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
>> Tested-by: Yicong Yang <yangyicong at hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua at hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
>> Reviewed-by: Tim Chen <tim.c.chen at linux.intel.com>
>> Reviewed-by: Chen Yu <yu.c.chen at intel.com>
>> ---
>>  kernel/sched/fair.c     | 59 +++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/sched.h    |  1 +
>>  kernel/sched/topology.c | 12 +++++++++
>>  3 files changed, 67 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b3e25be58e2b..d91bf64f81f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7012,6 +7012,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&sched_cluster_active)) {
>> +		struct sched_group *sg = sd->groups;
>> +
>> +		if (sg->flags & SD_CLUSTER) {
>> +			for_each_cpu_wrap(cpu, sched_group_span(sg), target + 1) {
>> +				if (!cpumask_test_cpu(cpu, cpus))
>> +					continue;
>> +
>> +				if (has_idle_core) {
>> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> +					if ((unsigned int)i < nr_cpumask_bits)
>> +						return i;
>> +				} else {
>> +					if (--nr <= 0)
>> +						return -1;
>> +					idle_cpu = __select_idle_cpu(cpu, p);
>> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +						return idle_cpu;
>> +				}
>> +			}
>> +			cpumask_andnot(cpus, cpus, sched_group_span(sg));
>> +		}
>> +	}
>> +
>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>  		if (has_idle_core) {
>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> @@ -7019,7 +7043,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  				return i;
>>  
>>  		} else {
>> -			if (!--nr)
>> +			if (--nr <= 0)
>>  				return -1;
>>  			idle_cpu = __select_idle_cpu(cpu, p);
>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> @@ -7121,7 +7145,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	bool has_idle_core = false;
>>  	struct sched_domain *sd;
>>  	unsigned long task_util, util_min, util_max;
>> -	int i, recent_used_cpu;
>> +	int i, recent_used_cpu, prev_aff = -1;
>>  
>>  	/*
>>  	 * On asymmetric system, update task utilization because we will check
>> @@ -7148,8 +7172,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	 */
>>  	if (prev != target && cpus_share_cache(prev, target) &&
>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>> -	    asym_fits_cpu(task_util, util_min, util_max, prev))
>> -		return prev;
>> +	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
>> +		if (!static_branch_unlikely(&sched_cluster_active))
>> +			return prev;
>> +
>> +		if (cpus_share_resources(prev, target))
>> +			return prev;
> 
> I have one minor question, previously Peter mentioned that he wants to get rid of the
> percpu sd_share_id, not sure if he means that not using it in select_idle_cpu()
> or remove that variable completely to not introduce extra space? 
> Hi Peter, could you please give us more hints on this? thanks.
> 
> If we wants to get rid of this variable, would this work?
> 
> 	if ((sd->groups->flags & SD_CLUSTER) &&
> 	    cpumask_test_cpu(prev, sched_group_span(sd->groups))
> 		return prev
> 

In the current implementation, nop, we haven't deferenced the @sd yet and we don't
need to if scanning is not needed.

Since we're on the quick path without scanning here, I wonder it'll be a bit more
efficient to use a per-cpu id rather than deference the rcu and do the bitmap
computation.

Thanks.