[PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Wed Feb 16 02:00:14 PST 2022

On 2022/2/16 17:19, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -----Original Message-----
>> From: Barry Song [mailto:21cnbao at gmail.com]
>> Sent: Wednesday, February 16, 2022 10:13 PM
>> To: Gautham R. Shenoy <gautham.shenoy at amd.com>
>> Cc: Srikar Dronamraju <srikar at linux.vnet.ibm.com>; yangyicong
>> <yangyicong at huawei.com>; Peter Zijlstra <peterz at infradead.org>; Ingo Molnar
>> <mingo at redhat.com>; Juri Lelli <juri.lelli at redhat.com>; Vincent Guittot
>> <vincent.guittot at linaro.org>; Tim Chen <tim.c.chen at linux.intel.com>; LKML
>> <linux-kernel at vger.kernel.org>; LAK <linux-arm-kernel at lists.infradead.org>;
>> Dietmar Eggemann <dietmar.eggemann at arm.com>; Steven Rostedt
>> <rostedt at goodmis.org>; Ben Segall <bsegall at google.com>; Daniel Bristot de
>> Oliveira <bristot at redhat.com>; Zengtao (B) <prime.zeng at hisilicon.com>;
>> Jonathan Cameron <jonathan.cameron at huawei.com>; ego at linux.vnet.ibm.com;
>> Linuxarm <linuxarm at huawei.com>; Song Bao Hua (Barry Song)
>> <song.bao.hua at hisilicon.com>; Guodong Xu <guodong.xu at linaro.org>
>> Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
>> wake-up path
>>
>> On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao at gmail.com> wrote:
>>>
>>> On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy at amd.com>
>> wrote:
>>>>
>>>>
>>>> On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
>>>>
>>>>>> We already figured out that there are no idle CPUs in this cluster.
>> So dont
>>>>>> we gain performance by picking a idle CPU/core in the neighbouring cluster.
>>>>>> If there are no idle CPU/core in the neighbouring cluster, then it does
>> make
>>>>>> sense to fallback on the current cluster.
>>>>>
>>>>> What you suggested is exactly the approach we have tried at the first
>> beginning
>>>>> during debugging. but we didn't gain performance according to benchmark,
>> we
>>>>> were actually losing. that is why we added this line to stop ping-pong:
>>>>>          /* Don't ping-pong tasks in and out cluster frequently */
>>>>>          if (cpus_share_resources(target, prev_cpu))
>>>>>             return target;
>>>>>
>>>>> If we delete this, we are seeing a big loss of tbench while system
>>>>> load is medium
>>>>> and above.
>>>>
>>>> Thanks for clarifying this Barry. Indeed, if the workload is sensitive
>>>> to data ping-ponging across L2 clusters, this heuristic makes sense. I
>>>> was thinking of workloads that require lower tail latency, in which
>>>> case exploring the larger LLC would have made more sense, assuming
>>>> that the larger LLC has an idle core/CPU.
>>>>
>>>> In the absence of any hints from the workload, like something that
>>>> Peter had previous suggested
>>>>
>> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass
>> .net/),
>>>> optimizing for cache-access seems to be the right thing to do.
>>>
>>> Thanks, gautham.
>>>
>>> Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
>>> To me, the case we are discussing seems to be more complicated than
>>> applying some scheduling policy on separate tasks by SCHED_BATCH
>>> or IDLE.
>>>
>>> For example, in case we have a process, and this process has 20 threads.
>>> thread0-9 might care about cache-coherence latency and want to avoid
>>> ping-ponging, and thread10-thread19 might want to have tail-latency
>>> as small as possible. So we need some way to tell kernel, "hey, bro, please
>>> try to keep thread0-9 still as ping-ponging will hurt them while trying your
>>> best to find idle cpu in a wider range for thread10-19". But it seems
>>> SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
>>> into groups, and is also incapable of telling kernel different groups have
>>> different needs.
>>>
>>> So it seems we want some special cgroups to organize tasks and we can apply
>>> some special hints on each different group. for example, putting thread0-9
>>> in a cgroup and thread10-19 in another, then:
>>> 1. apply "COMMUNCATION-SENSITVE" on the 1st group
>>> 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
>>> I am not quite sure how to do this and if this can find its way into
>>> the mainline.
>>>
>>> On the other hand, for this particular patch, the most controversial
>>> part is those
>>> two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
>>> like tbench only when system load is high, so I wonder if the approach[1]
>> from
>>> Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
>>> avoid the controversial part.
>>> since their patch can also shrink the scanning range while llc load is high.
>>>
>>> [1]
>> https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/
>>
>> Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
>> problem and make sure there is no performance regression for tbench
>> while load is
>> high after we remove the code to avoid ping-pong:
>>
>> 5.17-rc1: vanilla
>> rc1 + chenyu: vanilla + chenyu's LLC overload patch
>> rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
>> rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
>> the code avoiding ping-pong
>> rc1+cls: vanilla + my this patchset
>>
>> tbench running on numa 0 &1:
>>                             5.17-rc1          rc1 + chenyu
>> rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
>> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*
>> 357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
>> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*
>> 714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
>> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*
>> 1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
>> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*
>> 2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
>> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*
>> 5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
>> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*
>> 10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
>> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*
>> 10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
>> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*
>> 13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
>> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*
>> 11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
>> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*
>> 14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
>>
>> tbench running on numa 0 only:
>>
>>                             5.17-rc1          rc1 + chenyu
>> rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
>> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*
>> 358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
>> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*
>> 710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
>> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*
>> 1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
>> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*
>> 2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
>> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*
>> 5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
>> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*
>> 9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
>> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*
>> 6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
>> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*
>> 6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
>> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*
>> 9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
>> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*
>> 9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
>>
> 
> Sorry, it seems the format is broken. Let me re-post the data.
> 
>  5.17-rc1: vanilla
>  rc1 + chenyu: vanilla + chenyu's LLC overload patch
>  rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
>  rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset - the code avoiding ping-pong
>  rc1+cls: vanilla + my this patchset
> 
> tbench running on numa 0&1:
>                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*      357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*      714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*     1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*     2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*     5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*    10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*    10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*    13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*    11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*    14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
> 

Yes I think it'll also benefit for the cluster's conditon.

But 128 threads seems like a weired point that Chen's patch on 5.17-rc1 (without this series) causes degradation,
which in Chen's tbench test it doesn't cause that much when the 2 * cpu number == threads[*]:

case            	load    	baseline(std%)	compare%( std%)
loopback        	thread-224	 1.00 (  0.17)	 +2.30 (  0.10)

[*] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/

> tbench running on numa 0 only:
>                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*      358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*      710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*     1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*     2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*     5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*     9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*     6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*     6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*     9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*     9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
> 
>> like rc1+cls, in some
>> cases(256, 512 threads on numa0&1), it is even much better.
>>
>> Thanks
>> Barry