[PATCH] nvme-pci: fix resume after AER recovery
Keith Busch
kbusch at kernel.org
Thu Feb 2 11:43:50 PST 2023
On Thu, Feb 02, 2023 at 06:47:35PM +0000, Grochowski, Maciej wrote:
> > I've been trying to look for this code in latest upstream and it has changed entirely. Any chance you could do a quick run with Linux 6.1?
>
> It gave me the result as on older kernel: failure in power state change and device disappeared after test (logs below)
> ```
> [ 365.052300] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
> [ 365.062200] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
> [ 365.070439] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
> [ 365.081824] pcieport 0000:00:03.4: AER: broadcast error_detected message
> [ 365.088635] nvme nvme5: frozen state error detected, reset controller
> [ 365.157742] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
> [ 366.205193] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
> [ 366.213394] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
> [ 366.220312] pcieport 0000:00:03.4: AER: broadcast slot_reset message
> [ 366.226771] nvme nvme5: restart after slot reset
> [ 366.232018] pcieport 0000:00:03.4: re-enabling LTR
> [ 366.239088] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)
> ...
> [ 366.994113] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
> [ 367.003924] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
> [ 367.013663] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa80a144d)
> [ 367.023595] pcieport 0000:00:03.4: AER: broadcast resume message
> [ 367.045269] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
We'll do a secondary bus reset right before restarting the controller. It
sounds like this particular device isn't recoverying from.
Does the device come back if you manually remove/rescan? Something like this:
# echo 1 > /sys/bus/pci/devices/0000:0b:00.0/remove
# echo 1 > /sys/bus/pci/rescan
More information about the Linux-nvme
mailing list