Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

Bjorn Helgaas helgaas at kernel.org
Sat Mar 30 06:46:55 PDT 2024


[+cc Keith, Sagi, Hannes, Kai-Heng, +bcc silverspring from bugzilla]

On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote:
> On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote:
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> > 
> > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into
> > > suspend then press the power button again it will not wake up. 
> > > 
> > > I don't have a clue as to -why- this commit breaks wake from
> > > suspend on steam deck, but it does. Bisected to:
> > > 
> > > ```
> > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> > > Author: Bjorn Helgaas <bhelgaas at google.com>
> > > Date:   Tue Mar 7 14:32:43 2023 -0600
> > > 
> > >     nvme-pci: drop redundant pci_enable_pcie_error_reporting()
> > >     
> > >     pci_enable_pcie_error_reporting() enables the device to send ERR_*
> > >     Messages.  Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> > >     native"), the PCI core does this for all devices during enumeration, so the
> > >     driver doesn't need to do it itself.
> > >     
> > >     Remove the redundant pci_enable_pcie_error_reporting() call from the
> > >     driver.  Also remove the corresponding pci_disable_pcie_error_reporting()
> > >     from the driver .remove() path.
> > >     
> > >     Note that this only controls ERR_* Messages from the device.  An ERR_*
> > >     Message may cause the Root Port to generate an interrupt, depending on the
> > >     AER Root Error Command register managed by the AER service driver.
> > >     
> > >     Signed-off-by: Bjorn Helgaas <bhelgaas at google.com>
> > >     Reviewed-by: Chaitanya Kulkarni <kch at nvidia.com>
> > >     Signed-off-by: Christoph Hellwig <hch at lst.de>
> > > 
> > >  drivers/nvme/host/pci.c | 6 +-----
> > >  1 file changed, 1 insertion(+), 5 deletions(-)
> > > ```

> > > Reverting that commit by itself on top of 6.5.9 (stable) allows
> > > it to wake from suspend properly.

> > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090
> 
> Thanks, I requested some dmesg logs and lspci output to help debug
> this.

silverspring attached lspci output and a dmesg log from v6.8 to the
bugzilla and also noted that "pci=noaer" works around the problem.

The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant
pci_enable_pcie_error_reporting()")
(https://git.kernel.org/linus/1ad11eafc63a)

1ad11eafc63a removed pci_disable_pcie_error_reporting() from the
nvme_suspend() path, so we now leave the PCIe Device Control error
enables set when we didn't before.  My theory is that the PCIe link
goes down during suspend, which causes an error interrupt, and the
interrupt causes a problem on Steam Deck.  Maybe there's some BIOS
connection.

"pci=noaer" would work around this because those error enables would
never be set in the first place.

I asked reporters to test the debug patches below to disable those
error interrupts during suspend.

I don't think this would be the *right* fix; if we need to do this, I
think it should be done by the PCI core, not by individual drivers.
Kai-Heng has been suggesting this for a while for a different
scenario.

Bjorn


commit 60c07557d0cc ("Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"")
Author: Bjorn Helgaas <bhelgaas at google.com>
Date:   Fri Mar 29 17:54:30 2024 -0500

    Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"
    
    This reverts commit 69b264df8a412820e98867dbab871c6526c5e5aa.


diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..273f9c6f93dd 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -243,6 +243,18 @@ static int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 	return pcibios_err_to_errno(rc);
 }
 
+int pci_disable_pcie_error_reporting(struct pci_dev *dev)
+{
+	int rc;
+
+	if (!pcie_aer_is_native(dev))
+		return -EIO;
+
+	rc = pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
+	return pcibios_err_to_errno(rc);
+}
+EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
+
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..425e5e430e65 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -40,9 +40,14 @@ struct aer_capability_regs {
 int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log);
 
 #if defined(CONFIG_PCIEAER)
+int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
 #else
+static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
+{
+	return -EINVAL;
+}
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	return -EINVAL;
commit 8efb88cf23d4 ("nvme-pci: disable error reporting in nvme_dev_disable()")
Author: Bjorn Helgaas <bhelgaas at google.com>
Date:   Fri Mar 29 17:52:39 2024 -0500

    nvme-pci: disable error reporting in nvme_dev_disable()
    
    Debug patch.
    
    The PCI core enables error reporting in pci_aer_init() for all devices that
    advertise AER support.
    
    During suspend, nvme_suspend() calls nvme_dev_disable() in several cases.
    Prior to 1ad11eafc63a, nvme_dev_disable() disabled error reporting.
    
    After 1ad11eafc63a, error reporting will remain enabled during suspend.
    Maybe having error reporting enabled during suspend causes a problem on
    Steam Deck.
    
    "pci=noaer" prevents pci_aer_init() from enabling error reporting, so as
    long as the BIOS doesn't enable it, error reporting should always be
    disabled.
    
      nvme_suspend
        nvme_disable_prepare_reset
          nvme_dev_disable


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..2be838b5d1f6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/acpi.h>
+#include <linux/aer.h>
 #include <linux/async.h>
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
@@ -2603,8 +2604,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_suspend_io_queues(dev);
 	nvme_suspend_queue(dev, 0);
 	pci_free_irq_vectors(pdev);
-	if (pci_is_enabled(pdev))
+	if (pci_is_enabled(pdev)) {
+		pci_disable_pcie_error_reporting(pdev);
 		pci_disable_device(pdev);
+	}
 	nvme_reap_pending_cqes(dev);
 
 	nvme_cancel_tagset(&dev->ctrl);



More information about the Linux-nvme mailing list