NVMe APST high latency power states being skipped

Andy Lutomirski luto at kernel.org
Tue May 23 14:11:59 PDT 2017


On Tue, May 23, 2017 at 1:19 PM,  <Mario.Limonciello at dell.com> wrote:
>> > There are some configurations that have multiple NVMe disks.
>> > For example the Precision 7520 can have up to 3.
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : A400 NVMe SanDisk 512GB
>> ...
>> > ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:5.30W
>> > ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:3.30W
>> > ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:3.30W
>> > ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> >
>>
>> 44.5mW saved and totally crazy latency.
>>
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
>> ...
>> > ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:-
>> > ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:-
>> > ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
>> >           rwt:3 rwl:3 idle_power:- active_power:-
>> > ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
>> >           rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.
>>
>> >
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
>> ...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:-
>> > ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:-
>> > ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
>> >           rwt:3 rwl:3 idle_power:- active_power:-
>> > ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
>> >           rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
>> made disks.
>
> Well so the important one here I think is jumping down to PS3.  That's a much bigger
> drop in power across all of these disks.  The Liteon one will obviously go into PS3
> in the current patch, but the other two are just going to be vampires.

Ah, I missed that when reading the numbers.

>
>>
>> I'm not convinced that there's any chassis type for which this type of
>> default makes sense.
>>
> I guess I'm wondering where you came up with 25000 as the default:
> +static unsigned long default_ps_max_latency_us = 25000;
>
> Was it based across results of testing a bunch of disks, or from
> experimentation with a few higher end SSDs?

It was based on results across a bunch of disks, where "a bunch" == 2,
one that I own and one that Niranjan has. :)  Also, 25ms is a nice
round number.  I could be persuaded to increase it.  (Although the
SanDisk one should hit PS3 as well, no?)

I could also be persuaded to change the relevant parameter from (enlat
+ exlat) to something else.  The spec says, in language that's about
as clear as mud, that starting to go non-operational and then doing
any actual work can take (enlat + exlat) time.  But maybe real disks
aren't quite that bad.  In any event, the common case should be just
exlat.

Also, jeez, that Toshiba disk must *suck* under the RSTe policy.  25ms
exit latency incurred after 60ms of idle time?  No thanks!

>
>> What would perhaps make sense is to have system-wide
>> performance-vs-power controls and to integrate NVMe power saving into
>> it, presumably through the pm_qos framework.  Or to export more
>> information to userspace and have a user tool that sets all this up
>> generically.
>
> So I think you're already doing this.  power/pm_qos_latency_tolerance_us
> and the module parameter default_ps_max_latency_us can effectively
> change it.

What I mean is: <device>/power could also expose some hints about
exactly what the tradeoffs are (to the best of the kernel's knowledge)
so that user code could make a more informed and more automatic
decision.

>
> Kai Heng can comment more on the testing they've done and the performance
> impact, but I understand that by tweaking those knobs they've been able to
> get all these disks into at least PS3 and saved a lot of power.
>
> We could go work with the TLP project  or power top guys and have them
> go and tweak the various sysfs knobs to make more of these disks work,
> but I would rather the kernel had good defaults across this collection of disks.

Agreed.



More information about the Linux-nvme mailing list