[PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Tue Sep 6 01:46:00 PDT 2022

On 2022/9/6 13:28, K Prateek Nayak wrote:
> Hello Yicong,
> 
> We've tested the series on a dual socket Zen3 system (2 x 64C/128T).
> 
> tl;dr
> 
> - The results look good and the changes do not affect the Zen3 machine
>   which doesn't contain any sched domain with SD_CLUSTER flag set.
> 
> - With the latest BIOS, I don't see any regression due to the addition
>   of the new per CPU variables.
>   We had observed a regression in tbench previously when testing the
>   v4 of the series on the system with a slightly outdated BIOS
>   (https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)
>   but that doesn't seem to be the case with the latest BIOS :)
> 
> Detailed results from the standard benchmarks are reported below.
> 
> On 8/22/2022 1:06 PM, Yicong Yang wrote:
>> From: Yicong Yang <yangyicong at hisilicon.com>
>>
>> This is the follow-up work to support cluster scheduler. Previously
>> we have added cluster level in the scheduler for both ARM64[1] and
>> X86[2] to support load balance between clusters to bring more memory
>> bandwidth and decrease cache contention. This patchset, on the other
>> hand, takes care of wake-up path by giving CPUs within the same cluster
>> a try before scanning the whole LLC to benefit those tasks communicating
>> with each other.
>>
>> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
>> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
>>
> 
> Discussed below are the results from running standard benchmarks on
> a dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:      5.19.0 tip sched/core
> - cluster:  5.19.0 tip sched/core + both the patches of the series
> 
> When we started testing, the tip was at:
> commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> NPS1
> 
> Test:		      tip                    cluster
>  1-groups:	   4.31 (0.00 pct)	   4.31 (0.00 pct)
>  2-groups:	   4.93 (0.00 pct)	   4.86 (1.41 pct)
>  4-groups:	   5.38 (0.00 pct)	   5.36 (0.37 pct)
>  8-groups:	   5.59 (0.00 pct)	   5.54 (0.89 pct)
> 16-groups:	   7.18 (0.00 pct)	   7.47 (-4.03 pct)
> 
> NPS2
> 
> Test:		      tip                     cluster
>  1-groups:	   4.25 (0.00 pct)	   4.40 (-3.52 pct)
>  2-groups:	   4.83 (0.00 pct)	   4.73 (2.07 pct)
>  4-groups:	   5.25 (0.00 pct)	   5.18 (1.33 pct)
>  8-groups:	   5.56 (0.00 pct)	   5.45 (1.97 pct)
> 16-groups:	   6.72 (0.00 pct)	   6.63 (1.33 pct)
> 
> NPS4
> 
> Test:		      tip                     cluster
>  1-groups:	   4.24 (0.00 pct)	   4.23 (0.23 pct)
>  2-groups:	   4.88 (0.00 pct)	   4.78 (2.04 pct)
>  4-groups:	   5.30 (0.00 pct)	   5.25 (0.94 pct)
>  8-groups:	   5.66 (0.00 pct)	   5.61 (0.88 pct)
> 16-groups:	   6.79 (0.00 pct)	   7.05 (-3.82 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> NPS1
> 
> #workers:     tip                       cluster
>   1:	  37.00 (0.00 pct)	     22.00 (40.54 pct)
>   2:	  39.00 (0.00 pct)	     23.00 (41.02 pct)
>   4:	  41.00 (0.00 pct)	     30.00 (26.82 pct)
>   8:	  53.00 (0.00 pct)	     47.00 (11.32 pct)
>  16:	  73.00 (0.00 pct)	     73.00 (0.00 pct)
>  32:	 116.00 (0.00 pct)	    117.00 (-0.86 pct)
>  64:	 217.00 (0.00 pct)	    221.00 (-1.84 pct)
> 128:	 477.00 (0.00 pct)	    444.00 (6.91 pct)
> 256:	1062.00 (0.00 pct)	   1050.00 (1.12 pct)
> 512:   47552.00 (0.00 pct)	  48576.00 (-2.15 pct)
> 
> NPS2
> 
> #workers:     tip                       cluster
>   1:	  20.00 (0.00 pct)	     20.00 (0.00 pct)
>   2:	  22.00 (0.00 pct)	     23.00 (-4.54 pct)
>   4:	  30.00 (0.00 pct)	     31.00 (-3.33 pct)
>   8:	  46.00 (0.00 pct)	     49.00 (-6.52 pct)
>  16:	  70.00 (0.00 pct)	     72.00 (-2.85 pct)
>  32:	 120.00 (0.00 pct)	    118.00 (1.66 pct)
>  64:	 215.00 (0.00 pct)	    216.00 (-0.46 pct)
> 128:	 482.00 (0.00 pct)	    449.00 (6.84 pct)
> 256:	1042.00 (0.00 pct)	    995.00 (4.51 pct)
> 512:   47552.00 (0.00 pct)	  47296.00 (0.53 pct)
> 
> NPS4
> 
> #workers:     tip                       cluster
>   1:	  18.00 (0.00 pct)	     20.00 (-11.11 pct)
>   2:	  23.00 (0.00 pct)	     22.00 (4.34 pct)
>   4:	  27.00 (0.00 pct)	     30.00 (-11.11 pct)
>   8:	  57.00 (0.00 pct)	     60.00 (-5.26 pct)
>  16:	  76.00 (0.00 pct)	     84.00 (-10.52 pct)
>  32:	 120.00 (0.00 pct)	    115.00 (4.16 pct)
>  64:	 219.00 (0.00 pct)	    212.00 (3.19 pct)
> 128:	 459.00 (0.00 pct)	    442.00 (3.70 pct)
> 256:	1078.00 (0.00 pct)	    983.00 (8.81 pct)
> 512:   47040.00 (0.00 pct)	  48192.00 (-2.44 pct)
> 
> Note: schbench displays lot of run to run variance for
> low worker count. This behavior is due to the timing of
> new-idle balance which is not consistent across runs.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> NPS1
> 
> Clients:      tip            	      cluster
>     1	   573.26 (0.00 pct)	   572.61 (-0.11 pct)
>     2	  1131.19 (0.00 pct)	  1122.41 (-0.77 pct)
>     4	  2100.07 (0.00 pct)	  2081.74 (-0.87 pct)
>     8	  3809.88 (0.00 pct)	  3732.14 (-2.04 pct)
>    16	  6560.72 (0.00 pct)	  6289.22 (-4.13 pct)
>    32	 12203.23 (0.00 pct)	 11811.74 (-3.20 pct)
>    64	 22389.81 (0.00 pct)	 21587.79 (-3.58 pct)
>   128	 32449.37 (0.00 pct)	 32967.15 (1.59 pct)
>   256	 58962.40 (0.00 pct)	 56604.63 (-3.99 pct)
>   512	 59608.71 (0.00 pct)	 56529.95 (-5.16 pct) * (Machine Overloaded)
>   512	 57925.05 (0.00 pct)	 56697.38 (-2.11 pct) [Verification Run]
>  1024	 58037.02 (0.00 pct)	 55751.53 (-3.93 pct)
> 
> NPS2
> 
> Clients:      tip                     cluster
>     1	   574.20 (0.00 pct)	   572.49 (-0.29 pct)
>     2	  1131.56 (0.00 pct)	  1149.53 (1.58 pct)
>     4	  2132.26 (0.00 pct)	  2084.18 (-2.25 pct)
>     8	  3812.20 (0.00 pct)	  3683.04 (-3.38 pct)
>    16	  6457.61 (0.00 pct)	  6340.70 (-1.81 pct)
>    32	 12263.82 (0.00 pct)	 11714.15 (-4.48 pct)
>    64	 22224.11 (0.00 pct)	 21226.34 (-4.48 pct)
>   128	 33040.38 (0.00 pct)	 32478.99 (-1.69 pct)
>   256	 56547.25 (0.00 pct)	 52915.71 (-6.42 pct) * (Machine Overloaded)
>   256    55631.80 (0.00 pct)     52905.99 (-4.89 pct) [Verification Run]
>   512	 56220.67 (0.00 pct)	 54735.69 (-2.64 pct)
>  1024	 56048.88 (0.00 pct)	 54426.63 (-2.89 pct)
> 
> NPS4
> 
> Clients:     tip                      cluster
>     1	   575.50 (0.00 pct)	   570.65 (-0.84 pct)
>     2	  1138.70 (0.00 pct)	  1137.75 (-0.08 pct)
>     4	  2070.66 (0.00 pct)	  2103.18 (1.57 pct)
>     8	  3811.70 (0.00 pct)	  3573.52 (-6.24 pct) *
>     8	  3769.53 (0.00 pct)      3653.05 (-3.09 pct) [Verification Run]
>    16	  6312.80 (0.00 pct)	  6212.41 (-1.59 pct)
>    32	 11418.14 (0.00 pct)	 11721.01 (2.65 pct)
>    64	 19671.16 (0.00 pct)	 20053.77 (1.94 pct)
>   128	 30258.53 (0.00 pct)	 32585.15 (7.68 pct)
>   256	 55838.10 (0.00 pct)	 51318.64 (-8.09 pct) * (Machine Overloaded)
>   256	 54291.03 (0.00 pct)     54379.80 (0.16 pct)  [Verification Run]
>   512	 55586.44 (0.00 pct)	 51538.93 (-7.28 pct) * (Machine Overloaded)
>   512	 54190.04 (0.00 pct)     54096.16 (-0.17 pct) [Verification Run]
>  1024	 56370.35 (0.00 pct)	 50768.68 (-9.93 pct) * (Machine Overloaded)
>  1024    56498.36 (0.00 pct)     54661.85 (-3.25 pct) [Verification Run]
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> NPS1
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 332237.51 (0.00 pct)	 338085.24 (1.76 pct)
> Scale:	 215236.94 (0.00 pct)	 214179.72 (-0.49 pct)
>   Add:	 250753.67 (0.00 pct)	 251181.86 (0.17 pct)
> Triad:	 259467.60 (0.00 pct)	 262541.92 (1.18 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 329320.65 (0.00 pct)	 336947.39 (2.31 pct)
> Scale:	 218102.78 (0.00 pct)	 219617.85 (0.69 pct)
>   Add:	 251283.30 (0.00 pct)	 251918.03 (0.25 pct)
> Triad:	 258044.33 (0.00 pct)	 261512.99 (1.34 pct)
> 
> NPS2
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 336926.24 (0.00 pct)	 324310.01 (-3.74 pct)
> Scale:	 220120.41 (0.00 pct)	 212795.43 (-3.32 pct)
>   Add:	 252428.34 (0.00 pct)	 254355.80 (0.76 pct)
> Triad:	 274268.23 (0.00 pct)	 261777.03 (-4.55 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   338126.49 (0.00 pct)    338947.03 (0.24 pct)
> Scale:   230229.59 (0.00 pct)    229991.65 (-0.10 pct)
>   Add:   253964.25 (0.00 pct)    264374.57 (4.09 pct)
> Triad:   272176.19 (0.00 pct)    274587.35 (0.88 pct)
> 
> NPS4
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   367144.56 (0.00 pct)    375452.26 (2.26 pct)
> Scale:   246928.04 (0.00 pct)    243651.53 (-1.32 pct)
>   Add:   272096.30 (0.00 pct)    272845.33 (0.27 pct)
> Triad:   286644.55 (0.00 pct)    290925.20 (1.49 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 351980.15 (0.00 pct)	 375854.72 (6.78 pct)
> Scale:	 254918.41 (0.00 pct)	 255904.90 (0.38 pct)
>   Add:	 272722.89 (0.00 pct)	 274075.11 (0.49 pct)
> Triad:   283340.94 (0.00 pct)	 287608.77 (1.50 pct)
> 
> ~~~~~~~~~~~~~~~~~~~~
> ~ Additional notes ~
> ~~~~~~~~~~~~~~~~~~~~
> 
> - schbench is know to have a noticeable run-to-run variation for lower
>   worker counts and any improvements or regression observed can be
>   safely ignored. The results are included to make sure there are
>   no unnecessarily large regressions as a result of task pileup.
> 
> - tbench shows slight run to run variation with larger number of
>   clients on both tip and patched kernel. This is expected as the machine
>   is overloaded at that point (equivalent of two or more tasks per CPU).
>   "Verification Run" shows none of these regressions are persistent.
> 
>>
>> [..snip..]
>>
> 
> Overall, the changes look good and doesn't affect system without a
> SD_CLUSTER domain like the Zen3 system used during testing.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak at amd.com>
> 

Thanks a lot for the testing and verification on the Zen3 system.

Regards,
Yicong