[Bug 112121] New: Some PCIe options cause devices to be removed after suspend

Mon Mar 21 09:36:37 PDT 2016

Hi Mike,

I'm sorry this slipped through the cracks.   I apologize for the
inability of Google Inbox to send plaintext email; I use mutt
because that's a hassle for me, too.

On Sat, Feb 13, 2016 at 11:39:52PM +0000, Mike Lothian wrote:
> On 8 February 2016 at 13:51, Bjorn Helgaas <bhelgaas at google.com> wrote:
> > [+cc linux-pci, NVMe folks, power management folks]
> >
> > On Sun, Feb 7, 2016 at 11:04 AM,  <bugzilla-daemon at bugzilla.kernel.org> wrote:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=112121
> >>
> >>             Bug ID: 112121
> >>            Summary: Some PCIe options cause devices to be removed after
> >>                     syspend
> >>            Product: Drivers
> >>            Version: 2.5
> >>     Kernel Version: 4.5-rc2
> >>           Hardware: All
> >>                 OS: Linux
> >>               Tree: Mainline
> >>             Status: NEW
> >>           Severity: normal
> >>           Priority: P1
> >>          Component: PCI
> >>           Assignee: drivers_pci at kernel-bugs.osdl.org
> >>           Reporter: mike at fireburn.co.uk
> >>         Regression: No
> >>
> >> Created attachment 203091
> >>   --> https://bugzilla.kernel.org/attachment.cgi?id=203091&action=edit
> >> Dmesg showing PCIe device removals
> >>
> >> I was having issues with suspend, when the machine was being resumed iommu
> >> started removing devices - including my PCIe NVMe drive which contained my root
> >> partition
> >>
> >> The problem showed up with:
> >>
> >> [*] PCI support
> >> [*]   Support mmconfig PCI config space access
> >> [*]   PCI Express Port Bus support
> >> [*]     PCI Express Hotplug driver
> >> [*]     Root Port Advanced Error Reporting support
> >> [*]       PCI Express ECRC settings control
> >> < >       PCIe AER error injector support
> >> -*-     PCI Express ASPM control
> >> [ ]       Debug PCI Express ASPM
> >>           Default ASPM policy (BIOS default)  --->
> >> [*]   Message Signaled Interrupts (MSI and MSI-X)
> >> [ ]   PCI Debugging
> >> [*]   Enable PCI resource re-allocation detection
> >> < >   PCI Stub driver
> >> [*]   Interrupts on hypertransport devices
> >> [ ] PCI IOV support
> >> [*] PCI PRI support
> >> -*- PCI PASID support
> >>     PCI host controller drivers  ----
> >> < > PCCard (PCMCIA/CardBus) support  ----
> >> [*] Support for PCI Hotplug  --->
> >> < > RapidIO support
> >>
> >>
> >> This is what I have now:
> >>
> >> [*] PCI support
> >> [*]   Support mmconfig PCI config space access
> >> [*]   PCI Express Port Bus support
> >> [ ]     Root Port Advanced Error Reporting support
> >> -*-     PCI Express ASPM control
> >> [ ]       Debug PCI Express ASPM
> >>           Default ASPM policy (BIOS default)  --->
> >> [*]   Message Signaled Interrupts (MSI and MSI-X)
> >> [*]   PCI Debugging
> >> [ ]   Enable PCI resource re-allocation detection
> >> < >   PCI Stub driver
> >> [*]   Interrupts on hypertransport devices
> >> [ ] PCI IOV support
> >> [ ] PCI PRI support
> >> [ ] PCI PASID support
> >>     PCI host controller drivers  ----
> >> < > PCCard (PCMCIA/CardBus) support  ----
> >> [ ] Support for PCI Hotplug  ----
> >> < > RapidIO support
> >>
> >> I tried disabling the iommu driver first but it had no effect
> >>
> >> If people are interested I could play with the above options to see which one
> >> causes the issue
> >
> > My guess is that PCI hotplug is the important one.  It would be nice
> > if dmesg contained enough information to connect nvme0n1 to a PCI
> > device.  It'd be even nicer if the PCI core noted device removals or
> > whatever happened here.
> >
> > You don't get any more details if you boot with "ignore_loglevel", do you?
> >
> > Mike, you didn't mark this as a regression, so I assume it's always
> > been this way, and we just haven't noticed it because most people
> > enable PCI hotplug (or whatever the relevant config option is).
> 
> I've just tested this again, I enabled PCI Hotplug & PCIe Hotplug and
> nothing - then I noticed I hadn't enabled the ACPI Hotplug driver -
> once I did the issue re-appeared
> 
> I then had to use testdisk to restore my partition table :'(
> 
> I've attached the updated dmesg & my .config

Correct me if I'm wrong:

  - With CONFIG_HOTPLUG_PCI_ACPI not set, suspend/resume works fine
  - With CONFIG_HOTPLUG_PCI_ACPI=y, resume fails as shown in your dmesg log
    (https://bugzilla.kernel.org/attachment.cgi?id=203621)