SSD low power state during system suspend

Kevin Rowland kevin.p.rowland at gmail.com
Mon Oct 24 16:40:56 PDT 2022


Hey all,

We're having a discussion in the PCI list about a panic during suspend
when PCIe ASPM is enabled on a link with an SSD at the other end [1].
I'm rooting around in the NVMe drivers now, so I wanted to ask some
specific questions here.

The motivation for enabling ASPM is so that we can put the downstream
SSD in its lowest power state (PS4 in our case), by getting past the
`pcie_aspm_enabled()` check in `nvme_suspend()`, which allows us to
call `nvme_set_power_state(ctrl, ctrl->npss)` [2].

If we don't enable ASPM, then the SSD stays in PS0, the PCIe layer
puts the endpoint controller in D3_HOT, and the SSD consumes more
power than we would like when the system is suspended. In fact, the
vendor says that the SSD should consume much less power that what
we've measured.

After enabling ASPM, the panic during suspend happens when attempting
to mask interrupts while the PCIe controller is unpowered and
unclocked, during `suspend_disable_secondary_cpus()`. One way around
this is to simply keep the PCIe controller powered and clocked until
later in the suspend process. The QCOM folks on this thread [3] had
some success doing this by postponing some clock operations to
`syscore_suspend()`, but I wasn't able to get a similar scheme working
on our NXP SoC.

Anyway, now I'm exploring other options to put the SSD in PS4 and
avoid the panic, without applying messy patches to the platform power
management logic. One thing I decided to try is to call
`nvme_set_power_state(ctrl, ctrl->npss)` during suspend, and then to
unconditionally call both `pci_load_saved_state(pdev, NULL)` and
`nvme_disable_prepare_reset()` after that; the latter calls
`nvme_pci_disable()`, which in turn disables and frees MSIs. I did
this by simply commenting out the `if()` condition here [4].

The way I see it, this puts the SSD in PS4, frees PCIe MSIs, and
allows the PCIe PM subsystem to put the endpoint in D3_HOT. On our
hardware we can't remove power from the SSD, so it stays in D3_HOT.

Experimentally this kind of halfway works - interrupts have been freed
during `nvme_disable_prepare_reset()` so we no longer attempt to mask
them during `suspend_disable_secondary_cpus()`, and I no longer see
any panic while suspending! I am waiting for current measurements to
see if the SSD is actually in PS4 and drawing less power.

Unfortunately, on resume, things go terribly wrong and the SSD doesn't
respond to requests for Identify Controller data:
```
[  139.130155] nvme 0001:01:00.0: nvme_resume()
[  199.484042] nvme nvme0: I/O 5 QID 0 timeout, disable controller
[  199.664294] nvme nvme0: Identify Controller failed (-4)
[  199.664309] nvme nvme0: Removing after probe failure status: -5
```

So, am I crazy to be pursuing this? Can I _theoretically_ put the SSD
in a lower power state, then call `nvme_disable_prepare_reset()` and
expect it to stay in the lower power state while the system suspends?
Should I keep investigating the Identify Controller timeout during
resume, or is this approach bound to fail?

Thanks,
Kevin

[1] https://lore.kernel.org/linux-pci/CAHK3GzwYmk6Fr6-YCjEmCweCPBdWJwHDz4Vc8CdX0xfT7b=-xQ@mail.gmail.com/
[2] https://elixir.bootlin.com/linux/v5.10.72/source/drivers/nvme/host/pci.c#L3096
[3] https://lore.kernel.org/linux-pci/1663669347-29308-2-git-send-email-quic_krichai@quicinc.com/
[4] https://elixir.bootlin.com/linux/v5.10.72/source/drivers/nvme/host/pci.c#L3100



More information about the Linux-nvme mailing list