[bugzilla-daemon at bugzilla.kernel.org: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3]
Bjorn Helgaas
helgaas at kernel.org
Wed Sep 23 12:03:27 EDT 2020
[+cc IOMMU and NVMe folks]
Sorry, I forgot to forward this to linux-pci when it was first
reported.
Apparently this happens with v5.9-rc3, and may be related to
50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint"),
which appeared in v5.8-rc3.
There are several dmesg logs and proposed patches in the bugzilla, but
no analysis yet of what the problem is. From the first dmesg
attachment (https://bugzilla.kernel.org/attachment.cgi?id=292327):
[ 50.434945] PM: suspend entry (deep)
[ 50.802086] nvme 0000:01:00.0: saving config space at offset 0x0 (reading 0x11e0f)
[ 50.842775] ACPI: Preparing to enter system sleep state S3
[ 50.858922] ACPI: Waking up from system sleep state S3
[ 50.883622] nvme 0000:01:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 50.947352] nvme 0000:01:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0x11e0f)
[ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
[ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
[ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error status/mask=00200000/00010000
[ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First)
[ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[ 50.947843] nvme nvme0: frozen state error detected, reset controller
I suspect the nvme "can't change power state" and restore config space
errors are a consequence of the DPC event. If DPC disables the link,
the device is inaccessible.
I don't know what caused the ACS Violation. The AER TLP Header Log
might have a clue, but unfortunately we didn't print it.
Tangent:
The fact that we didn't print the AER TLP Header log looks like
a bug in itself. PCIe r5.0, sec 6.2.7, table 6-5, says many
errors, including ACS Violation, should log the TLP header. But
aer_get_device_error_info() only reads the log for error bits in
AER_LOG_TLP_MASKS, which doesn't include PCI_ERR_UNC_ACSV.
I don't think there's a "TLP Header Log Valid" bit, and it's ugly to
have to update AER_LOG_TLP_MASKS if new errors are added. I think
maybe we should always print the header log.
----- Forwarded message from bugzilla-daemon at bugzilla.kernel.org -----
Date: Fri, 04 Sep 2020 14:31:20 +0000
From: bugzilla-daemon at bugzilla.kernel.org
To: bjorn at helgaas.com
Subject: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in
hint" makes NVMe config space not accessible after S3
Message-ID: <bug-209149-41252 at https.bugzilla.kernel.org/>
https://bugzilla.kernel.org/show_bug.cgi?id=209149
Bug ID: 209149
Summary: "iommu/vt-d: Enable PCI ACS for platform opt in hint"
makes NVMe config space not accessible after S3
Product: Drivers
Version: 2.5
Kernel Version: mainline
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: PCI
Assignee: drivers_pci at kernel-bugs.osdl.org
Reporter: kai.heng.feng at canonical.com
Regression: No
Here's the error:
[ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01
source:0x0000
[ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error
detected
[ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error
status/mask=00200000/00010000
[ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First)
[ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[ 50.947843] nvme nvme0: frozen state error detected, reset controller
--
You are receiving this mail because:
You are watching the assignee of the bug.
----- End forwarded message -----
More information about the Linux-nvme
mailing list