NVMe APST high latency power states being skipped
Andy Lutomirski
luto at kernel.org
Tue May 23 14:11:59 PDT 2017
On Tue, May 23, 2017 at 1:19 PM, <Mario.Limonciello at dell.com> wrote:
>> > There are some configurations that have multiple NVMe disks.
>> > For example the Precision 7520 can have up to 3.
>> >
>> > NVME Identify Controller:
>> ...
>> > mn : A400 NVMe SanDisk 512GB
>> ...
>> > ps 0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
>> > rwt:0 rwl:0 idle_power:- active_power:5.30W
>> > ps 1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
>> > rwt:1 rwl:1 idle_power:- active_power:3.30W
>> > ps 2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
>> > rwt:2 rwl:2 idle_power:- active_power:3.30W
>> > ps 3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
>> > rwt:0 rwl:0 idle_power:- active_power:-
>> > ps 4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
>> > rwt:0 rwl:0 idle_power:- active_power:-
>> >
>>
>> 44.5mW saved and totally crazy latency.
>>
>> >
>> > NVME Identify Controller:
>> ...
>> > mn : THNSF5512GPUK NVMe SED TOSHIBA 512GB
>> ...
>> > ps 0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> > rwt:0 rwl:0 idle_power:- active_power:-
>> > ps 1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
>> > rwt:1 rwl:1 idle_power:- active_power:-
>> > ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
>> > rwt:2 rwl:2 idle_power:- active_power:-
>> > ps 3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
>> > rwt:3 rwl:3 idle_power:- active_power:-
>> > ps 4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
>> > rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 6 mW saved and still fairly crazy latency. 70ms means you drop a couple frames.
>>
>> >
>> >
>> > NVME Identify Controller:
>> ...
>> > mn : CX2-GB1024-Q11 NVMe LITEON 1024GB
>> ...> ps 0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> > rwt:0 rwl:0 idle_power:- active_power:-
>> > ps 1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
>> > rwt:1 rwl:1 idle_power:- active_power:-
>> > ps 2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
>> > rwt:2 rwl:2 idle_power:- active_power:-
>> > ps 3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
>> > rwt:3 rwl:3 idle_power:- active_power:-
>> > ps 4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
>> > rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 90mW saved and still 100ms latency. Also, I didn't know that Lite-on
>> made disks.
>
> Well so the important one here I think is jumping down to PS3. That's a much bigger
> drop in power across all of these disks. The Liteon one will obviously go into PS3
> in the current patch, but the other two are just going to be vampires.
Ah, I missed that when reading the numbers.
>
>>
>> I'm not convinced that there's any chassis type for which this type of
>> default makes sense.
>>
> I guess I'm wondering where you came up with 25000 as the default:
> +static unsigned long default_ps_max_latency_us = 25000;
>
> Was it based across results of testing a bunch of disks, or from
> experimentation with a few higher end SSDs?
It was based on results across a bunch of disks, where "a bunch" == 2,
one that I own and one that Niranjan has. :) Also, 25ms is a nice
round number. I could be persuaded to increase it. (Although the
SanDisk one should hit PS3 as well, no?)
I could also be persuaded to change the relevant parameter from (enlat
+ exlat) to something else. The spec says, in language that's about
as clear as mud, that starting to go non-operational and then doing
any actual work can take (enlat + exlat) time. But maybe real disks
aren't quite that bad. In any event, the common case should be just
exlat.
Also, jeez, that Toshiba disk must *suck* under the RSTe policy. 25ms
exit latency incurred after 60ms of idle time? No thanks!
>
>> What would perhaps make sense is to have system-wide
>> performance-vs-power controls and to integrate NVMe power saving into
>> it, presumably through the pm_qos framework. Or to export more
>> information to userspace and have a user tool that sets all this up
>> generically.
>
> So I think you're already doing this. power/pm_qos_latency_tolerance_us
> and the module parameter default_ps_max_latency_us can effectively
> change it.
What I mean is: <device>/power could also expose some hints about
exactly what the tradeoffs are (to the best of the kernel's knowledge)
so that user code could make a more informed and more automatic
decision.
>
> Kai Heng can comment more on the testing they've done and the performance
> impact, but I understand that by tweaking those knobs they've been able to
> get all these disks into at least PS3 and saved a lot of power.
>
> We could go work with the TLP project or power top guys and have them
> go and tweak the various sysfs knobs to make more of these disks work,
> but I would rather the kernel had good defaults across this collection of disks.
Agreed.
More information about the Linux-nvme
mailing list