[PATCH] nvme-pci: fix resume after AER recovery

Wed Feb 8 23:55:12 PST 2023

On 08.02.2023 22:38, Grochowski, Maciej wrote:
>> Are you observing DPC events during the error injection recovery? I didn't see any in the previous logs.
>
>Correct there was no DPC triggered with error recovery.
>I initially run into issue with NVMe device not being able to recover when DPC event got triggered, and by trying to narrow down what happened I figured that same effect can be caused by injecting fatal aer error via aer-inject.
>
>> If you have DPC enabled, it really doesn't make sense to aer_inject on anything downstream capable ports. Those types of errors would be contained by the DPC hardware; the kernel will get a DPC event instead of an AER.
>
>Understand these two scenarios should be investigated separately. I will remove DPC from the picture then.
>I disabled DPC on my setup and removed kernel option. I run same experiment and I got same results, however by playing with connection to NVMe drive I was able to reproduce same result on KIOXIA.
>
>That lead me to the conclusion that this error may be dependent not on NVMe drive but on connection.
>To prove that I got Samsung PM9A3 in M2 factor (previously my experiments were done on U2 devices connected via SlimSAS or Oculink), interestingly M2 factor PM9A3 is able to recover from such scenario, proof below:
>
>```
>pcieport 0000:80:03.3: aer_inject: Injecting errors 00000000/00040000 into device 0000:84:00.0
>pcieport 0000:80:03.3: AER: Uncorrected (Fatal) error received: 0000:84:00.0
>nvme 0000:84:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
>pcieport 0000:80:03.3: AER: broadcast error_detected message
>nvme nvme0: frozen state error detected, reset controller
>pcieport 0000:80:03.3: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
>#d0_delay = 100ms
>pcieport 0000:80:03.3: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
>pcieport 0000:80:03.3: AER: Root Port link has been reset (0)
>pcieport 0000:80:03.3: AER: broadcast slot_reset message
>nvme nvme0: restart after slot reset
>nvme 0000:84:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xbc200000)
>nvme 0000:84:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xbc210004)
>nvme 0000:84:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
>pcieport 0000:80:03.3: AER: broadcast resume message
>nvme 0000:84:00.0: saving config space at offset 0x0 (reading 0xa80a144d)
>nvme 0000:84:00.0: saving config space at offset 0x4 (reading 0x100406)
>nvme 0000:84:00.0: saving config space at offset 0x8 (reading 0x1080200)
>nvme 0000:84:00.0: saving config space at offset 0xc (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x10 (reading 0xbc210004)
>nvme 0000:84:00.0: saving config space at offset 0x14 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x18 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x1c (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x20 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x24 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x28 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x2c (reading 0xa812144d)
>nvme 0000:84:00.0: saving config space at offset 0x30 (reading 0xbc200000)
>nvme 0000:84:00.0: saving config space at offset 0x34 (reading 0x40)
>nvme 0000:84:00.0: saving config space at offset 0x38 (reading 0x0)
>nvme 0000:84:00.0: saving config space at offset 0x3c (reading 0x1ff)
>nvme nvme0: Shutdown timeout set to 8 seconds
>nvme nvme0: 64/0/0 default/read/poll queues
>pcieport 0000:80:03.3: AER: device recovery successful
>```
>
>So my current conclusion is that this is very likely Hardware related issue not Driver/Software.
>I will dig deeper into it but for this particular thread I think there is not much more we can do.

Thanks for sharing this Maciej.

Our firmware team is investigating the issue and have not been able to
reproduce either. I will ask them to pause this until we have more data.

Seems we do not need a quirk then either.