[PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

Manivannan Sadhasivam manivannan.sadhasivam at linaro.org
Sat Nov 23 01:01:13 PST 2024


+ Ulf (also interested in this topic)

On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> [+cc Rafael]
> 
> On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > PCI core allows users to configure the D3Cold state for each PCI device
> > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> >
> > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > during system suspend if the user has allowed D3Cold for the device.
> > Existing checks in the NVMe driver decide whether to shut down the device
> > (based on platform/device limitations), so use this flag as the last resort
> > to keep the existing behavior.
> > 
> > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > D3Cold and the users can disallow it through sysfs if they want.
> 
> What problem does this solve?  I guess there must be a case where
> suspend leaves NVMe in a higher power state than you want?
> 

Yeah, this is the case for all systems that doesn't fit into the existing checks
in the NVMe suspend path:

1. ACPI based platforms
2. Controller doesn't support NPSS (hardware issue/limitation)
3. ASPM not enabled
4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag

In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
checks. Hence, they were not fully powered down during system suspend and always
in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
specific SoC powered down state that consumes just enough power to wake up the
SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
power state.

> What does it mean that this is the "last resort to keep the existing
> behavior"?  All the checks are OR'd together and none have side
> effects, so the order doesn't really matter.  It changes the existing
> behavior *unless* the user has explicitly cleared d3cold_allowed via
> sysfs.
> 

Since the checks are ORed, this new check is not going to change the existing
behavior for systems satisfying above checks i.e., even if the user changes the
flag to forbid D3Cold, it won't affect them and it *shoudn't*.

> pdev->d3cold_allowed is set by default, so I guess this change means
> that unless the user clears d3cold_allowed, we let the PCI core decide
> the suspend state instead of managing it directly in nvme_suspend()?
> 

If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
the PCI controller driver can turn off all bus specific resources. Otherwise,
NVMe driver will put the device into low power mode and the controller driver
has to do something similar to retain the device power.

> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam at linaro.org>
> > ---
> >  drivers/nvme/host/pci.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 4b9fda0b1d9a..a4d4687854bf 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> >  	 */
> >  	if (pm_suspend_via_firmware() || !ctrl->npss ||
> >  	    !pcie_aspm_enabled(pdev) ||
> > -	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > +	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > +	    pdev->d3cold_allowed)
> >  		return nvme_disable_prepare_reset(ndev, true);
> 
> I guess your intent is that if users prevent use of D3cold via sysfs,
> we'll use the NVMe-specific power states, and otherwise, the PCI core
> will decide?
> 

Not PCI core, but the controller drivers actually which takes care of powering
down the PCI bus resources.

> I think returning 0 here means the PCI core decides what state to use
> in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path.  This
> could be any state from D0 to D3cold depending on platform support and
> wakeup considerations (see pci_target_state()).
> 
> I'm not sure the use of pdev->d3cold_allowed here really expresses
> your underlying intent.  It suggests that you're really hoping for
> D3cold, but that's only a possibility if firmware supports it, and we
> have no visibility into that here.
> 

I'm not relying on firmware to do anything here. If firmware has to decide the
suspend state, it should already satisfy the pm_suspend_via_firmware() check in
nvme_suspend(). Here, the controller driver takes care of putting the device
into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
because of NVMe driver's behavior.

> It also seems contrary to the earlier comment that suggests we prefer
> host managed nvme power settings:
> 
>   * The platform does not remove power for a kernel managed suspend so
>   * use host managed nvme power settings for lowest idle power if
>   * possible. This should have quicker resume latency than a full device
>   * shutdown.  But if the firmware is involved after the suspend or the
>   * device does not support any non-default power states, shut down the
>   * device fully.

This above comment satisfies the ACPI platforms as the firmware controls the
suspend state. On DT platforms, even though the firmware takes care of suspend
state, it still relies on the controller driver to relinquish the votes for PCI
resources. Only then, the firmware will put the whole SoC in power down mode
a.k.a CX power collapse mode in Qcom SoCs.

We did attempt so solve this problem in multiple ways, but the lesson learned
was, kernel cannot decide the power mode without help from userspace. That's the
reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
userspace to override the D3Cold if it wants based on platform requirement.

This is similar to how UFS allows users to configure power states of both the
device and controller:

/sys/bus/platform/drivers/ufshcd/*/spm_lvl
/sys/bus/platform/devices/*.ufs/spm_lvl

If the 'd3cold_allowed' attribute is not a good fit for this usecase, then I'd
like to introduce a new attribute similar to UFS.

- Mani

-- 
மணிவண்ணன் சதாசிவம்



More information about the Linux-nvme mailing list