[PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Abel Wu wuyun.abel at bytedance.com
Mon Jun 20 06:37:59 PDT 2022


On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
> Hello Tim,
> 
> Thank you for looking into this.
> 
> On 6/17/2022 10:20 PM, Tim Chen wrote:
>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>
>>>
>>> --
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>   	return sd;
>>>   }
>>>   
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>> -DECLARE_PER_CPU(int, sd_share_id);
>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> --
>>>
>>> The System-map of each kernel is as follows:
>>>
>>> - On "tip"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>>
>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_cluster     <-----
>>> 0000000000020538 D sd_llc_shared
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_share_id    <-----
>>> 0000000000020544 D sd_llc_id
>>> 0000000000020548 D sd_llc_size
>>> 0000000000020550 D sd_llc
>>>
>>>
>>> - On "tip + both patches (Move declaration to top)"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>
>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>> cache line.  They are both used in the wake up path.
> 
> We are still evaluating keeping which set of variables on the same
> cache line will provide the best results.
> 
> I would have expected the two kernel variants - "tip" and the
> "tip + both patches (Move declaration to top)" - to give similar results
> as their System map for all the old variables remain the same and the
> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
> However, now we see a regression for higher number of client.
> 
> This probably hints that access to "sd_cluster" variable in Patch 2 and
> bringing in the extra cache line could be responsible for the regression
> we see with "tip + both patches (Move declaration to top)"
> 
>>
>>
>>> 0000000000020548 D sd_share_id    <-----
>>> 0000000000020550 D sd_cluster     <-----
>>>
>>>> Or change the layout a bit to see if there's any difference,
>>>> like:
>>>>
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>
>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>> the kernel config as well? I'd like to compare the config as well.
>>>
>>> I've attached the kernel config used to build the test kernel
>>> to this mail.
>>>
>>>> Thanks,
>>>> Yicong
>>>
>>> We are trying to debug the issue using perf and find an optimal
>>> arrangement of the per cpu declarations to get the relevant data
>>> used in the wakeup path on the same 64B cache line.
>>
>> A check of perf c2c profile difference between tip and the move new declarations to
>> the top case could be useful.  It may give some additional clues of possibel
>> false sharing issues.
> 
> Thank you for the suggestion. We are currently looking at perf counter
> data to see how the cache efficiency has changed between the two kernels.
> We suspect that the need for the data in the other cache line too in the
> wakeup path is resulting in higher cache misses in the levels closer to
> the core.
> 
> I don't think it is a false sharing problem as these per CPU data are
> set when the system first boots up and will only be change again during
> a CPU hotplug operation. However, it might be beneficial to see the c2c
> profile if perf counters don't indicate anything out of the ordinary.
> 

Would it be possible if any other frequent-written variables share
same cacheline with these per-cpu data causing false sharing? What
about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?




More information about the linux-arm-kernel mailing list