[PATCH 1/1] nvme: extend and modify the APST configuration algorithm

Alexey Bogoslavsky Alexey.Bogoslavsky at wdc.com
Wed Apr 28 16:45:06 BST 2021


On Wed, Apr 28, 2021 at 5:43 AM hch at lst.de <hch at lst.de> wrote:
>
> Adding Andy who wrote the original APST code.
>
> On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> > From: Alexey Bogoslavsky <Alexey.Bogoslavsky at wdc.com>
> >
> > The algorithm that was used until now for building the APST configuration
> > table has been found to produce entries with excessively long ITPT
> > (idle time prior to transition) for devices declaring relatively long
> > entry and exit latencies for non-operational power states. This leads
> > to unnecessary waste of power and, as a result, failure to pass
> > mandatory power consumption tests on Chromebook platforms.
> >
> > The new algorithm is based on two predefined ITPT values and two
> > predefined latency tolerances. Based on these values, as well as on
> > exit and entry latencies reported by the device, the algorithm looks
> > for up to 2 suitable non-operational power states to use as primary
> > and secondary APST transition targets. The predefined values are
> > supplied to the nvme driver as module parameters:
> >
> >  - apst_primary_timeout_ms (default: 100)
> >  - apst_secondary_timeout_ms (default: 2000)
> >  - apst_primary_latency_tol_us (default: 15000)
> >  - apst_secondary_latency_tol_us (default: 100000)
> >
> > The algorithm echoes the approach used by Intel's and Microsoft's drivers
> > on Windows. The specific default parameter values are also based on those
> > drivers. Yet, this patch doesn't introduce the ability to dynamically
> > regenerate the APST table in the event of switching the power source from
> > AC to battery and back. Adding this functionality may be considered in the
> > future. In the meantime, the timeouts and tolerances reflect a compromise
> > between values used by Microsoft for AC and battery scenarios.
> >
> > In most NVMe devices the new algorithm causes them to implement a more
> > aggressive power saving policy. While beneficial in most cases, this
> > sometimes comes at the price of a higher IO processing latency in certain
> > scenarios as well as at the price of a potential impact on the drive's
> > endurance (due to more frequent context saving when entering deep non-
> > operational states). So in order to provide a fallback for systems where
> > these regressions cannot be tolerated, the patch allows to revert to
> > the legacy behavior by setting either apst_primary_timeout_ms or
> > apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> > fine tuning the default values of the module parameters) the legacy behavior
> > can be removed.

>  Can you give an example of the APST states and latencies on a device
>  for which this is useful?

Sure. Originally, we faced this problem with WD SN530 device that failed to pass
Google's power consumption tests. The device reports the following latencies:
PS3: entry: 3900, exit: 11000 (translates to 745ms ITPT)
PS4: entry: 5000, exit: 39000 (translates to 2200ms ITPT)

Then we started looking at other devices and found more with a similar problem,
e.g. Crucial P5:
PS3: entry: 10000, exit: 2500 (translates to 625ms ITPT)
PS4: entry: 12000, exit: 35000 (translates to 2350ms ITPT)

>  I'm not opposed to adjusting the algorithm, but I'd like to understand
>  what we're up against.  If Linux were the only game in town, I would
>  say that the approach in this patch is unfortunate because of the
>  arbitrary thresholds it introduces, but if it tracks Windows, then
>  we're probably okay.
>--Andy

I agree. Using arbitrary numbers would be a very bad idea. But the numbers I'm
suggesting are indeed based on the schemes used on Windows, so they have proved
viable on a huge number of devices.

Regards,
Alexey



More information about the Linux-nvme mailing list