[PATCH 4/4] arm64: dts: rockchip: Add OPP data for CPU cores on RK3588

Daniel Lezcano daniel.lezcano at linaro.org
Sun Jan 28 07:06:25 PST 2024


Hi Alexey,

On 27/01/2024 20:41, Alexey Charkov wrote:
> On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic at manjaro.org> wrote:
>>
>> On 2024-01-26 14:44, Alexey Charkov wrote:
>>> On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano
>>> <daniel.lezcano at linaro.org> wrote:
>>>> On 26/01/2024 08:49, Dragan Simic wrote:
>>>>> On 2024-01-26 08:30, Alexey Charkov wrote:
>>>>>> On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic at manjaro.org> wrote:
>>>>>>> On 2024-01-26 07:44, Alexey Charkov wrote:
>>>>>>>> On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic at manjaro.org>
>>>>>>>> wrote:
>>>>>>>>> On 2024-01-25 10:30, Daniel Lezcano wrote:
>>>>>>>>>> On 24/01/2024 21:30, Alexey Charkov wrote:
>>>>>>>>>>> By default the CPUs on RK3588 start up in a conservative
>>>>>>> performance
>>>>>>>>>>> mode. Add frequency and voltage mappings to the device tree to
>>>>>>> enable
>>>>
>>>> [ ... ]
>>>>
>>>>>> Throttling would also lower the voltage at some point, which cools it
>>>>>> down much faster!
>>>>>
>>>>> Of course, but the key is not to cool (and slow down) the CPU cores too
>>>>> much, but just enough to stay within the available thermal envelope,
>>>>> which is where the same-voltage, lower-frequency OPPs should shine.
>>>>
>>>> That implies the resulting power is sustainable which I doubt it is
>>>> the
>>>> case.
>>>>
>>>> The voltage scaling makes the cooling effect efficient not the
>>>> frequency.
>>>>
>>>> For example:
>>>>          opp5 = opp(2GHz, 1V) => 2 BogoWatt
>>>>          opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt
>>>>          opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt
>>>>          [ other states but we focus on these 3 ]
>>>>
>>>> opp5->opp4 => -5% compute capacity, -5% power, ratio=1
>>>> opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6
>>>>
>>>> opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9
>>>>
>>>> In burst operation (no thermal throttling), opp4 is pointless we agree
>>>> on that.
>>>>
>>>> IMO the following will happen: in burst operation with thermal
>>>> throttling we hit the trip point and then the step wise governor
>>>> reduces
>>>> opp5 -> opp4. We have slight power reduction but the temperature does
>>>> not decrease, so at the next iteration, it is throttle at opp3. And at
>>>> the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3.
>>>>
>>>> It is probable we end up with an equivalent frequency average (or
>>>> compute capacity avg).
>>>>
>>>> opp4 <-> opp3 (longer duration in states, less transitions)
>>>> opp5 <-> opp3 (shorter duration in states, more transitions)
>>>>
>>>> Some platforms had their higher OPPs with the same voltage and they
>>>> failed to cool down the CPU in the long run.
>>>>
>>>> Anyway, there is only one way to check it out :)
>>>>
>>>> Alexey, is it possible to compare the compute duration for 'dhrystone'
>>>> with these voltage OPP and without ? (with a period of cool down
>>>> between
>>>> the test in order to start at the same thermal condition) ?
>>>
>>> Sure, let me try that - would be interesting to see the results. In my
>>> previous tinkering there were cases when the system stayed at 2.35GHz
>>> for all big cores for non-trivial time (using the step-wise thermal
>>> governor), and that's an example of "same voltage, lower frequency".
>>> Other times though it throttled one cluster down to 1.8GHz and kept
>>> the other at 2.4GHz, and was also stationary at those parameters for
>>> extended time. This probably indicates that both of those states use
>>> sustainable power in my cooling setup.
>>
>> IMHO, there are simply too many factors at play, including different
>> possible cooling setups, so providing additional CPU throttling
>> granularity can only be helpful.  Of course, testing and recording
>> data is the way to move forward, but I think we should use a few
>> different tests.
> 
> Soooo, benchmarking these turned out a bit trickier than I had hoped
> for. Apparently, dhrystone uses an unsigned int rather than an
> unsigned long for the loops count (or something of that sort), which
> means that I can't get it to run enough loops to heat up my chip from
> a stable idle state to the throttling state (due to counter
> wraparound). So I ended up with a couple of crutches, namely:
>   - run dhrystone continuously on 6 out of 8 cores to make the chip
> warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on
> my machine cores 6-7 are usually the first ones to get throttled, due
> to whatever thermal peculiarities)
>   - wait for the temperature to stabilize (which happens at 79.5C)
>   - then run timed dhrystone on the remaining 2 out of 6 cores (big
> ones) to see how throttling with different OPP tables affects overall
> performance.

Thanks for taking the time to test.

> In the end, here's what I got with the 'original' OPP table (including
> "same voltage - different frequencies" states):
> alchark at rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
> duration: 0 seconds
> number of threads: 2
> number of loops: 4000000000000000
> delay between starting threads: 0 seconds
> 
> Dhrystone(1.1) time for 1233977344 passes = 29.7
> This machine benchmarks at 41481539 dhrystones/second
>                             23609 DMIPS
> Dhrystone(1.1) time for 1233977344 passes = 29.8
> This machine benchmarks at 41476618 dhrystones/second
>                             23606 DMIPS
> 
> Total dhrystone run time: 30.864492 seconds.
> 
> And here's what I got with the 'reduced' OPP table (keeping only the
> highest frequency state for each voltage):
> alchark at rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
> duration: 0 seconds
> number of threads: 2
> number of loops: 4000000000000000
> delay between starting threads: 0 seconds
> 
> Dhrystone(1.1) time for 1233977344 passes = 30.9
> This machine benchmarks at 39968549 dhrystones/second
>                            22748 DMIPS
> Dhrystone(1.1) time for 1233977344 passes = 31.0
> This machine benchmarks at 39817431 dhrystones/second
>                            22662 DMIPS
> 
> Total dhrystone run time: 31.995136 seconds.
> 
> Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in
> performance in this setup. This is probably far from a reliable
> estimate, but I guess it indeed indicates that having lower-frequency
> states might be beneficial in some load scenarios.

What is the duration between these two tests?

I would be curious if it is repeatable by inverting the setup (reduced 
OPP table and then original OPP table).

BTW: I used -l 10000 for a ~30 seconds workload more or less on the 
rk3399, may be -l 20000 will be ok for the rk3588.

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog




More information about the linux-arm-kernel mailing list