[PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Mon Feb 7 21:42:29 PST 2022

On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy at amd.com> wrote:
>
>
> On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
>
> > > We already figured out that there are no idle CPUs in this cluster. So dont
> > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > If there are no idle CPU/core in the neighbouring cluster, then it does make
> > > sense to fallback on the current cluster.
> >
> > What you suggested is exactly the approach we have tried at the first beginning
> > during debugging. but we didn't gain performance according to benchmark, we
> > were actually losing. that is why we added this line to stop ping-pong:
> >          /* Don't ping-pong tasks in and out cluster frequently */
> >          if (cpus_share_resources(target, prev_cpu))
> >             return target;
> >
> > If we delete this, we are seeing a big loss of tbench while system
> > load is medium
> > and above.
>
> Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> to data ping-ponging across L2 clusters, this heuristic makes sense. I
> was thinking of workloads that require lower tail latency, in which
> case exploring the larger LLC would have made more sense, assuming
> that the larger LLC has an idle core/CPU.
>
> In the absence of any hints from the workload, like something that
> Peter had previous suggested
> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass.net/),
> optimizing for cache-access seems to be the right thing to do.

Thanks, gautham.

Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
To me, the case we are discussing seems to be more complicated than
applying some scheduling policy on separate tasks by SCHED_BATCH
or IDLE.

For example, in case we have a process, and this process has 20 threads.
thread0-9 might care about cache-coherence latency and want to avoid
ping-ponging, and thread10-thread19 might want to have tail-latency
as small as possible. So we need some way to tell kernel, "hey, bro, please
try to keep thread0-9 still as ping-ponging will hurt them while trying your
best to find idle cpu in a wider range for thread10-19". But it seems
SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
into groups, and is also incapable of telling kernel different groups have
different needs.

So it seems we want some special cgroups to organize tasks and we can apply
some special hints on each different group. for example, putting thread0-9
in a cgroup and thread10-19 in another, then:
1. apply "COMMUNCATION-SENSITVE" on the 1st group
2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
I am not quite sure how to do this and if this can find its way into
the mainline.

On the other hand, for this particular patch, the most controversial
part is those
two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
like tbench only when system load is high, so I wonder if the approach[1] from
Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
avoid the controversial part.
since their patch can also shrink the scanning range while llc load is high.

[1] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/

>
>
> >
> > Thanks
> > Barry
>
> --
> Thanks and Regards
> gautham.

Thanks
Barry