[PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Thu Jun 16 00:55:15 PDT 2022

Hi Prateek,

Seems my previous reply is in the wrong format so the server rejected it..just repost..

Thanks a lot for the test!

On 2022/6/15 22:19, K Prateek Nayak wrote:
> Hello Yicong,
> 
> I replied to the v3 of the series by mistake!
> (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/)
> But rest assured all the analysis discussed there was done with
> the v4 patch series. I'll add the same observations below so we
> can continue discussion on v4.
> 
> We are observing some serious regression with tbench with this patch
> series applied. The issue doesn't seem to be related to the actual 
> functionality of the patch but how the patch changes the per CPU
> variable layout.
> 
> Discussed below are the results from running tbench on a dual
> socket Zen3 (2 x 64C/128T) configured in different NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:      5.19.0-rc2 tip sched/core
> - cluster:  5.19.0-rc2 tip sched/core + both the patches of the series
> 
> When we started testing, the tip was at:
> commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
> 
> * - Data points of concern
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS1
> 
> Clients:       tip                     cluster
>     1    444.41 (0.00 pct)       439.27 (-1.15 pct)
>     2    879.23 (0.00 pct)       831.49 (-5.42 pct)     *
>     4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)    *
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)   *
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)   *
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)  *
>   128    30795.27 (0.00 pct)     30861.34 (0.21 pct)
>   256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)
>   512    51287.93 (0.00 pct)     51855.55 (1.10 pct)
>  1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)
> 
> NPS2
> 
> Clients:       tip                     cluster
>     1    445.45 (0.00 pct)       441.75 (-0.83 pct)
>     2    869.24 (0.00 pct)       845.61 (-2.71 pct)
>     4    1644.28 (0.00 pct)      1586.49 (-3.51 pct)
>     8    3120.83 (0.00 pct)      2967.01 (-4.92 pct) 	*
>    16    5972.29 (0.00 pct)      5208.58 (-12.78 pct)   *
>    32    11776.38 (0.00 pct)     10229.53 (-13.13 pct)  *
>    64    20933.15 (0.00 pct)     17033.45 (-18.62 pct)  *
>   128    32195.00 (0.00 pct)     29507.85 (-8.34 pct)   *
>   256    24641.52 (0.00 pct)     27225.00 (10.48 pct)
>   512    50806.96 (0.00 pct)     51377.50 (1.12 pct)
>  1024    51993.96 (0.00 pct)     50773.35 (-2.34 pct)
> 
> NPS4
> 
> Clients:      tip                   cluster
>     1    442.10 (0.00 pct)       435.06 (-1.59 pct)
>     2    870.94 (0.00 pct)       858.64 (-1.41 pct)
>     4    1615.30 (0.00 pct)      1607.27 (-0.49 pct)
>     8    3195.95 (0.00 pct)      3020.63 (-5.48 pct)    *
>    16    5937.41 (0.00 pct)      5719.87 (-3.66 pct)
>    32    11800.41 (0.00 pct)     11229.65 (-4.83 pct)	*
>    64    20844.71 (0.00 pct)     20432.79 (-1.97 pct)
>   128    31003.62 (0.00 pct)     29441.20 (-5.03 pct)   *
>   256    27476.37 (0.00 pct)     25857.30 (-5.89 pct)   * [Know to have run to run variance]
>   512    52276.72 (0.00 pct)     51659.16 (-1.18 pct)
>  1024    51372.10 (0.00 pct)     51026.87 (-0.67 pct)
> 
> Note: tbench results for 256 workers are known to have
> run to run variation on the test machine. Any regression
> seen for the data point can be safely ignored.
> 
> The behavior is consistent for both tip and patched kernel
> across multiple runs of tbench.
> 
> ~~~~~~~~~~~~~~~~~~~~
> Analysis done so far
> ~~~~~~~~~~~~~~~~~~~~
> 
> To root cause this issue quicker, we have focused on 8 to 64 clients
> data points with the machine running in NPS1 mode.
> 
> - Even on disabling HW prefetcher, the behavior remains consistent
>   signifying HW prefetcher is not the cause of the problem.
> 
> - Bisecting:
> 
> When we ran the tests with only Patch 1 of the series, the
> regression was visible and the numbers were worse.
> 
> Clients:       tip                     cluster              Patch 1 Only
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)
> 
> We further bisected the hunks to narrow down the cause to the per CPU
> variable declarations. 
> 
> 
> On 6/9/2022 5:36 PM, Yicong Yang wrote:
>>
>> [..snip..]
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 01259611beb9..b9bcfcf8d14d 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DECLARE_PER_CPU(int, sd_llc_size);
>>  DECLARE_PER_CPU(int, sd_llc_id);
>> +DECLARE_PER_CPU(int, sd_share_id);
>>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> 
> The main reason for the regression seems to be the above declarations.
> The regression seem to go away if we do one of the following:
> 
> - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
>   instead of DECLARE_PER_CPU() and change the corresponding definition
>   below to DEFINE_PER_CPU_READ_MOSTLY().
> 
>   Clients:       tip                     Patch 1           Patch 1 (READ_MOSTLY)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3237.33 (-0.56 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5914.53 (-2.92 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11536.05 (3.40 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21162.33 (0.67 pct)
> 
> - Convert sd_share_id and sd_cluster to static arrays.
>   
>   Clients:       tip                    Patch 1            Patch 1 (Static Array)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3203.27 (-1.61 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    6198.35 (1.73 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11385.76 (2.05 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21919.80 (4.28 pct)
> 
> - Move the declarations of sd_share_id and sd_cluster to the top
> 
>   Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)
> 
> Unfortunately, none of these are complete solutions. For example, using
> DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
> but doesn't eliminate it entirely:
> 
>    Clients:     tip                     cluster           cluster (READ_MOSTLY)  
>     1      444.41 (0.00 pct)       439.27 (-1.15 pct)      435.95 (-1.90 pct)
>     2      879.23 (0.00 pct)       831.49 (-5.42 pct)      842.09 (-4.22 pct)
>     4      1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1598.77 (-3.03 pct)
>     8      3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3090.86 (-5.29 pct)	*
>    16      6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5360.28 (-10.82 pct)	*
>    32      12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11083.66 (-8.08 pct)	*
>    64      21258.21 (0.00 pct)     19021.09 (-10.52 pct)   20984.30 (-1.28 pct)
>   128      30795.27 (0.00 pct)     30861.34 (0.21 pct)     30735.20 (-0.19 pct)
>   256      25138.43 (0.00 pct)     24711.90 (-1.69 pct)    24021.21 (-4.44 pct)
>   512      51287.93 (0.00 pct)     51855.55 (1.10 pct)     51672.73 (0.75 pct)
>  1024      53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52620.02 (-1.04 pct)
> 
> We are still trying to root cause the underlying issue that
> brought about such drastic regression in tbench performance. 
> 
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..0595827d481d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DEFINE_PER_CPU(int, sd_llc_size);
>>  DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_share_id);
>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>
>>  [..snip..]
>>
> 
> We would like some time to investigate this issue and root cause
> the reason for this regression.

One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3
is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id
and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on
your test, I guess the change of the layout may affect something like the cache behaviour.

Can you also test with SMT disabled? Or change the layout a bit to see if there's any difference,
like:

 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

I need to further look into it and have some tests on a SMT machine. Would you mind to share
the kernel config as well? I'd like to compare the config as well.

Thanks,
Yicong