[PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

Fri Oct 20 00:36:28 PDT 2023

On 2023-10-19 at 11:33:23 +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong at hisilicon.com>
> 
> Chen Yu reports a hackbench regression of cluster wakeup when
> hackbench threads equal to the CPU number [1]. Analysis shows
> it's because we wake up more on the target CPU even if the
> prev_cpu is a good wakeup candidate and leads to the decrease
> of the CPU utilization.
> 
> Generally if the task's prev_cpu is idle we'll wake up the task
> on it without scanning. On cluster machines we'll try to wake up
> the task in the same cluster of the target for better cache
> affinity, so if the prev_cpu is idle but not sharing the same
> cluster with the target we'll still try to find an idle CPU within
> the cluster. This will improve the performance at low loads on
> cluster machines. But in the issue above, if the prev_cpu is idle
> but not in the cluster with the target CPU, we'll try to scan an
> idle one in the cluster. But since the system is busy, we're
> likely to fail the scanning and use target instead, even if
> the prev_cpu is idle. Then leads to the regression.
> 
> This patch solves this in 2 steps:
> o record the prev_cpu/recent_used_cpu if they're good wakeup
>   candidates but not sharing the cluster with the target.
> o on scanning failure use the prev_cpu/recent_used_cpu if
>   they're recorded as idle
> 
> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/
> 
> Reported-by: Chen Yu <yu.c.chen at intel.com>
> Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
> Signed-off-by: Yicong Yang <yangyicong at hisilicon.com>
>

Tested on 24 CPUs Jacobsville machine, 4 CPUs in one cluster sharing L2 Cache.
The baseline is sched/core on top of 
commit a36e5741bdc5 ("sched: Fix stop_one_cpu_nowait() vs hotplug"),
and compared with the whole patch set applied. I did not see any regression but
improvement on hackbench, please feel free to add:

Tested-and-reviewed-by: Chen Yu <yu.c.chen at intel.com>

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  0.26)	 +6.02 (  1.53)
process-pipe    	2-groups	 1.00 (  1.03)	 +1.97 (  0.70)
process-pipe    	4-groups	 1.00 (  0.26)	 +1.80 (  3.27)
process-sockets 	1-groups	 1.00 (  0.29)	 +1.63 (  0.86)
process-sockets 	2-groups	 1.00 (  1.17)	 +2.59 (  0.35)
process-sockets 	4-groups	 1.00 (  1.07)	 +3.86 (  0.51)
threads-pipe    	1-groups	 1.00 (  0.79)	 +8.17 (  0.48)
threads-pipe    	2-groups	 1.00 (  0.65)	 +6.34 (  0.23)
threads-pipe    	4-groups	 1.00 (  0.38)	 +4.61 (  1.04)
threads-sockets 	1-groups	 1.00 (  0.73)	 +0.80 (  0.35)
threads-sockets 	2-groups	 1.00 (  1.09)	 +2.81 (  1.18)
threads-sockets 	4-groups	 1.00 (  0.67)	 +2.30 (  0.20)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.48)	 +3.97 (  0.50)
TCP_RR          	12-threads	 1.00 (  0.11)	 +3.83 (  0.15)
TCP_RR          	18-threads	 1.00 (  0.18)	 +7.53 (  0.18)
TCP_RR          	24-threads	 1.00 (  0.34)	 +2.40 (  0.77)
TCP_RR          	30-threads	 1.00 ( 10.39)	 +2.22 ( 11.51)
TCP_RR          	36-threads	 1.00 ( 10.87)	 +2.06 ( 16.71)
TCP_RR          	42-threads	 1.00 ( 14.04)	 +2.10 ( 12.86)
TCP_RR          	48-threads	 1.00 (  5.89)	 +2.15 (  5.54)
UDP_RR          	6-threads	 1.00 (  0.20)	 +2.99 (  0.55)
UDP_RR          	12-threads	 1.00 (  0.18)	 +3.65 (  0.27)
UDP_RR          	18-threads	 1.00 (  0.34)	 +6.62 (  0.23)
UDP_RR          	24-threads	 1.00 (  0.60)	 -1.73 ( 12.54)
UDP_RR          	30-threads	 1.00 (  9.70)	 -0.62 ( 14.34)
UDP_RR          	36-threads	 1.00 ( 11.80)	 -0.05 ( 12.27)
UDP_RR          	42-threads	 1.00 ( 15.35)	 -0.05 ( 12.26)
UDP_RR          	48-threads	 1.00 (  5.58)	 -0.12 (  5.73)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.29)	 +2.51 (  0.24)
loopback        	12-threads	 1.00 (  0.08)	 +2.90 (  0.47)
loopback        	18-threads	 1.00 (  0.06)	 +6.85 (  0.07)
loopback        	24-threads	 1.00 (  0.20)	 +1.85 (  0.14)
loopback        	30-threads	 1.00 (  0.15)	 +1.37 (  0.07)
loopback        	36-threads	 1.00 (  0.12)	 +1.34 (  0.07)
loopback        	42-threads	 1.00 (  0.09)	 +0.91 (  0.04)
loopback        	48-threads	 1.00 (  0.11)	 +0.88 (  0.05)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  2.67)	 -1.89 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 (  8.08)	+12.86 (  2.32)

thanks,
Chenyu