[PATCH v2] perf/hx_arm_ni: Support uncore ARM NI-700 PMU

Yang Jialong 杨佳龙 jialong.yang at shingroup.cn
Wed Jan 31 22:35:42 PST 2024



在 2024/2/1 11:00, Yang Jialong 杨佳龙 写道:
> 
> 
> 在 2024/1/31 18:36, Krzysztof Kozlowski 写道:
>> On 31/01/2024 11:07, Yang Jialong 杨佳龙 wrote:
>>>
>>>
>>> 在 2024/1/31 17:38, Krzysztof Kozlowski 写道:
>>>> On 31/01/2024 10:07, Yang Jialong 杨佳龙 wrote:
>>>>>
>>>>>
>>>>> 在 2024/1/31 15:59, Krzysztof Kozlowski 写道:
>>>>>> On 31/01/2024 08:08, JiaLong.Yang wrote:
>>>>>>> This code is based on uncore PMUs arm_smmuv3_pmu and arm-cmn.
>>>>>>> One ni-700 can have many clock domains. Each of them has only one 
>>>>>>> PMU.
>>>>>>> Here one PMU corresponds to one 'struct ni_pmu' instance.
>>>>>>> PMU name will be ni_pmu_N_M, which N means different NI-700s and 
>>>>>>> M means
>>>>>>> different PMU in one NI-700. If only one NI-700 found in NI-700, 
>>>>>>> name will
>>>>>>> be ni_pmu_N.
>>>>>>> Node interface event name will be xxni_N_eventname, such as
>>>>>>> asni_0_rdreq_any. There are many kinds of type of nodes in one clock
>>>>>>> domain. Also means that there are many kinds of that in one PMU. 
>>>>>>> So we
>>>>>>> distinguish them by xxni string. Besides, maybe there are many nodes
>>>>>>> have same type. So we have number N in event name.
>>>>>>> By ni_pmu_0_0/asni_0_rdreq_any/, we can pinpoint accurate bus 
>>>>>>> traffic.
>>>>>>> Example1: perf stat -a -e 
>>>>>>> ni_pmu_0_0/asni_0_rdreq_any/,ni_pmu_0_0/cycles/
>>>>>>> EXample2: perf stat -a -e ni_pmu_0_0/asni,id=0,event=0x0/
>>>>>>>
>>>>>>> Signed-off-by: JiaLong.Yang <jialong.yang at shingroup.cn>
>>>>>>> ---
>>>>>>> v1 --> v2:
>>>>>>> 1. Submit MAINTANER Documentation/ files seperately.
>>>>>>
>>>>>> SEPARATE PATCHES, not patchsets. You have now checkpatch warnings
>>>>>> because of this...
>>>>>
>>>>> ...OK. But the MAINTANER file changing should be given in which one
>>>>> patches.
>>>>> I will submit patch v3 after talking and your permission.
>>>>>
>>>>>>
>>>>>>> 2. Delete some useless info printing.
>>>>>>> 3. Change print from pr_xxx to dev_xxx.
>>>>>>> 4. Fix more than 75 length log info.
>>>>>>> 5. Fix dts attribute pccs-id.
>>>>>>> 6. Fix generic name according to DT specification.
>>>>>>> 7. Some indentation.
>>>>>>> 8. Del of_match_ptr macro.
>>>>>>>
>>>>>>>     drivers/perf/Kconfig     |   11 +
>>>>>>>     drivers/perf/Makefile    |    1 +
>>>>>>>     drivers/perf/hx_arm_ni.c | 1284 
>>>>>>> ++++++++++++++++++++++++++++++++++++++
>>>>>>>     3 files changed, 1296 insertions(+)
>>>>>>>     create mode 100644 drivers/perf/hx_arm_ni.c
>>>>>>>
>>>>>>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>>>>>>> index ec6e0d9194a1..95ef8b13730f 100644
>>>>>>> --- a/drivers/perf/Kconfig
>>>>>>> +++ b/drivers/perf/Kconfig
>>>>>>> @@ -241,4 +241,15 @@ config CXL_PMU
>>>>>>>           If unsure say 'm'.
>>>>>>> +config HX_ARM_NI_PMU
>>>>>>> +       tristate "HX ARM NI-700 PMU"
>>>>>>> +       depends on PPC_HX_C2000 && 64BIT
>>>>>>
>>>>>> 1. There is no PPC_HX_C2000.
>>>>>
>>>>> I have been used to using this macro. However this macro is not 
>>>>> existed
>>>>> in mainline.
>>>>> I will replace it with ARM64. And del involved C code if OK.
>>>>>
>>>>> 64bit:
>>>>> __ffs(unsigned long) and __fls(unsigned long) will be wrong in 
>>>>> 32bit. I
>>>>> pass a u64 argument.
>>>>
>>>> One thing is where the code is supposed to run, second thing is compile
>>>> testing.
>>>>
>>>
>>> Now run on my company product, a 64bit PowerPC...
>>> But I think it's general for 64bit systems.
>>>
>>>> Why do you use __ffs, not __ffs64 which takes u64 if you really want
>>>> only 64bit argument? unsigned long != u64, so your code is not
>>>> architecture independent. You claim you wrote it on purpose as
>>>> non-architecture-independent, but then I claim it's a bug. We are
>>>> supposed to write code which is portable, as much as possible, assuming
>>>> it does not affect readability.
>>>>
>>>
>>> I write code in v5.18, there are __ffs64() and fls64(). Asymmetric.
>>
>> Sorry, that's a no go.
>>
>> That's some very, very old kernel. Do not develop on old kernels, but on
>> mainline. I also suspect that by basing your work on old kernel, you
>> duplicate a lot of issues already fixed.
>>
>>> There are some difference in return val between __ffs() and ffs64().
>>> __ffs(0) and ffs64(0) will give different value.
>>
>> __ffs64 calls __ffs, so why would results be different?
>>
>> Anyway, that's not really excuse.
>>
> 
> OK. Follow mainline.
> 
>>
>>>
>>> And I'm sure code run in 64bit. So I choose to use __ffs and __fls.
>>>
>>> Maybe it could be compatbile with 32bit. But I don't have a environment
>>> to test this.
>>>>
>>>>> struct ni_hw_perf_event will be big than limit.
>>>>> BUILD_BUG_ON(sizeof(struct ni_hw_perf_event) > offsetof(struct
>>>>> hw_perf_event, target));
>>>>
>>>> And why do you need to use any of such code? Please open one of 
>>>> hundreds
>>>> of other drivers which work correctly on 32 and 64-bit platforms.
>>>>
>>>
>>> Code for 64bit.
>>> This code is to avoid struct ni_hw_perf_event is too big than struct
>>> hw_perf_event::target.
>>
>> 1. Why would that matter? target is task_struct. It's size does not
>> matter. Maybe its offset matters, but not size.
>>
> 
> Offset.
> 
>> 2. So you claim that on 32-bit system the structure will be bigger than
>> on 64-bit system?
> 
> The structure will exceed the offset in 32bit. Maybe because the latter 
> changed more.
> OK. Dont care please.
> 
>>
>>> I learn it from arm-cmn.c.
>>
>> Are you copying patterns because they are good patterns or just because
>> you decided to copy?
> 
> Maybe this way is not very good for event framework.
> OK. Not an official way.
> 

First, the struct hw_perf_event is originally used to record some 
general information about its event. Example, struct 
hw_perf_event::state is used to record status by driver developer.

Second, But there are some field I don't need to use. And instread I 
want to use another name, just like en_bit_mask or is_cc, depending on hw.

Finally, considering the lifetime keeping of struct ni_hw_perf_event, 
this way is very well. I don't need to keep an eye on it. And I just 
need to promise the size. So I choose BUILD_BUG_ON();

What do you think?

Dont care about 32, 64 or offset.

>>
>>> ni_hw_perf_event will replace hw_perf_event.
>>> I will put some useful information in it with less space and good field
>>> names.
>>> But I can't exceed a limit.
>>>
>>>>>
>>>>>> 2. Nothing justified dependency on 64bit. Drop or explain. Your 
>>>>>> previous
>>>>>> message did not provide real rationale.
>>>>>
>>>>> If ARM64, then drop.
>>>>
>>>> ...
>>>>
>>>> ...
>>>>
>>>>>>> +    /* Step2: Traverse all clock domains. */
>>>>>>> +    for (cd_idx = 0; cd_idx < ni->cd_num; cd_idx++) {
>>>>>>> +        cd = cd_arrays[cd_idx];
>>>>>>> +
>>>>>>> +        num = ni_child_number(cd);
>>>>>>> +        dev_dbg(dev, "The %dth clock domain has %d child 
>>>>>>> nodes:", cd_idx, num);
>>>>>>> +
>>>>>>> +        /* Omit pmu node */
>>>>>>> +        ni_pmu = devm_kzalloc(dev, struct_size(ni_pmu, 
>>>>>>> ev_src_nodes, num - 1),
>>>>>>> +                      GFP_KERNEL);
>>>>>>> +        ni_pmu->ev_src_num = num - 1;
>>>>>>> +
>>>>>>> +        if (!ni_pmu)
>>>>>>> +            return -ENOMEM;
>>>>>>> +
>>>>>>> +        num_idx = 0;
>>>>>>> +        for (nd_idx = 0; nd_idx < num; nd_idx++) {
>>>>>>> +            nd = ni_child_pointer(pbase, cd, nd_idx);
>>>>>>> +
>>>>>>> +            node.base = nd;
>>>>>>> +            node.node_type = ni_node_node_type(nd);
>>>>>>> +
>>>>>>> +            if (unlikely(ni_node_type(nd) == NI_PMU))
>>>>>>> +                ni_pmu->pmu_node = node;
>>>>>>> +            else
>>>>>>> +                ni_pmu->ev_src_nodes[num_idx++] = node;
>>>>>>> +            dev_dbg(dev, "  name: %s   id: %d", 
>>>>>>> ni_node_name[node.type], node.id);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        ni_pmu->dev = dev;
>>>>>>> +        ni_pmu->ni = ni;
>>>>>>> +        ni->ni_pmus[cd_idx] = ni_pmu;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    devm_kfree(dev, cd_arrays);
>>>>>>
>>>>>> Why? If it is not device-lifetime then allocate with usual way.
>>>>>>
>>>>>
>>>>> No device-lifetime.
>>>>> Will allocate in stack.
>>>>
>>>> I was thinking about kzalloc. But if array is small, stack could be 
>>>> as well.
>>>>
>>>
>>> If I have to return before devm_kfree because of wrong, I will have 
>>> to use:
>>>
>>> goto out;
>>>
>>> out:
>>> kfree();
>>>
>>> But if I use devm_kzalloc, I will not be worried about that. Even if no
>>
>> devm* is not for that purpose. devm is for device-managed allocations.
>> Device does not manage your allocation.
>>
>>> device-lifetime.
>>> Isn't this a good way?
>>
>> Then you want cleanup.h and use proper __free().
> 
> Good NEW API. It does what I want.
> Learned more. Thanks.
> 
>>
>> Best regards,
>> Krzysztof
>>
>>



More information about the linux-arm-kernel mailing list