[PATCH] nvme-pci: fix resume after AER recovery

Keith Busch kbusch at kernel.org
Thu Feb 2 11:43:50 PST 2023


On Thu, Feb 02, 2023 at 06:47:35PM +0000, Grochowski, Maciej wrote:
> > I've been trying to look for this code in latest upstream and it has changed entirely.  Any chance you could do a quick run with Linux 6.1?
> 
> It gave me the result as on older kernel: failure in power state change and device disappeared after test (logs below)
> ```
> [  365.052300] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
> [  365.062200] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
> [  365.070439] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
> [  365.081824] pcieport 0000:00:03.4: AER: broadcast error_detected message
> [  365.088635] nvme nvme5: frozen state error detected, reset controller
> [  365.157742] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
> [  366.205193] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
> [  366.213394] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
> [  366.220312] pcieport 0000:00:03.4: AER: broadcast slot_reset message
> [  366.226771] nvme nvme5: restart after slot reset
> [  366.232018] pcieport 0000:00:03.4: re-enabling LTR
> [  366.239088] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)  
> ...
> [  366.994113] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
> [  367.003924] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
> [  367.013663] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa80a144d)
> [  367.023595] pcieport 0000:00:03.4: AER: broadcast resume message
> [  367.045269] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible

We'll do a secondary bus reset right before restarting the controller. It
sounds like this particular device isn't recoverying from.

Does the device come back if you manually remove/rescan? Something like this:

  # echo 1 > /sys/bus/pci/devices/0000:0b:00.0/remove
  # echo 1 > /sys/bus/pci/rescan



More information about the Linux-nvme mailing list