[PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
Bjorn Helgaas
helgaas at kernel.org
Fri Nov 22 14:20:50 PST 2024
[+cc Rafael]
On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> PCI core allows users to configure the D3Cold state for each PCI device
> through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> to allow/disallow the PCI devices to enter D3Cold during system suspend.
>
> So make use of this flag in the NVMe driver to shutdown the NVMe device
> during system suspend if the user has allowed D3Cold for the device.
> Existing checks in the NVMe driver decide whether to shut down the device
> (based on platform/device limitations), so use this flag as the last resort
> to keep the existing behavior.
>
> The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> D3Cold and the users can disallow it through sysfs if they want.
What problem does this solve? I guess there must be a case where
suspend leaves NVMe in a higher power state than you want?
What does it mean that this is the "last resort to keep the existing
behavior"? All the checks are OR'd together and none have side
effects, so the order doesn't really matter. It changes the existing
behavior *unless* the user has explicitly cleared d3cold_allowed via
sysfs.
pdev->d3cold_allowed is set by default, so I guess this change means
that unless the user clears d3cold_allowed, we let the PCI core decide
the suspend state instead of managing it directly in nvme_suspend()?
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam at linaro.org>
> ---
> drivers/nvme/host/pci.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 4b9fda0b1d9a..a4d4687854bf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> */
> if (pm_suspend_via_firmware() || !ctrl->npss ||
> !pcie_aspm_enabled(pdev) ||
> - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> + pdev->d3cold_allowed)
> return nvme_disable_prepare_reset(ndev, true);
I guess your intent is that if users prevent use of D3cold via sysfs,
we'll use the NVMe-specific power states, and otherwise, the PCI core
will decide?
I think returning 0 here means the PCI core decides what state to use
in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
could be any state from D0 to D3cold depending on platform support and
wakeup considerations (see pci_target_state()).
I'm not sure the use of pdev->d3cold_allowed here really expresses
your underlying intent. It suggests that you're really hoping for
D3cold, but that's only a possibility if firmware supports it, and we
have no visibility into that here.
It also seems contrary to the earlier comment that suggests we prefer
host managed nvme power settings:
* The platform does not remove power for a kernel managed suspend so
* use host managed nvme power settings for lowest idle power if
* possible. This should have quicker resume latency than a full device
* shutdown. But if the firmware is involved after the suspend or the
* device does not support any non-default power states, shut down the
* device fully.
Bjorn
More information about the Linux-nvme
mailing list