[PATCH] nvme-pci: fix resume after AER recovery
Grochowski, Maciej
Maciej.Grochowski at sony.com
Thu Feb 2 10:47:35 PST 2023
> I've been trying to look for this code in latest upstream and it has changed entirely. Any chance you could do a quick run with Linux 6.1?
Hi Christoph,
I tried same test on 6.1.9 (uname -r -> 6.1.9) for Samsung PM9A3
Samsung Electronics Co Ltd Device a80a
It gave me the result as on older kernel: failure in power state change and device disappeared after test (logs below)
```
[ 365.052300] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
[ 365.062200] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
[ 365.070439] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[ 365.081824] pcieport 0000:00:03.4: AER: broadcast error_detected message
[ 365.088635] nvme nvme5: frozen state error detected, reset controller
[ 365.157742] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[ 366.205193] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[ 366.213394] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
[ 366.220312] pcieport 0000:00:03.4: AER: broadcast slot_reset message
[ 366.226771] nvme nvme5: restart after slot reset
[ 366.232018] pcieport 0000:00:03.4: re-enabling LTR
[ 366.239088] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)
...
[ 366.994113] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
[ 367.003924] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
[ 367.013663] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa80a144d)
[ 367.023595] pcieport 0000:00:03.4: AER: broadcast resume message
[ 367.045269] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 367.061721] nvme nvme5: Removing after probe failure status: -19
[ 367.089330] nvme5n1: detected capacity change from 3750748848 to 0
[ 367.096431] pcieport 0000:00:03.4: AER: device recovery successful
[ 367.096470] nvme 0000:0b:00.0: vgaarb: pci_notify
[ 367.226016] pci 0000:0b:00.0: vgaarb: pci_notify
```
# ls /dev/nvme5n1
ls: cannot access '/dev/nvme5n1': No such file or directory
I run the same test for another Samsung drive: PM1733
0b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
```
[ 334.527200] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
[ 334.537072] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
[ 334.545320] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[ 334.556682] pcieport 0000:00:03.4: AER: broadcast error_detected message
[ 334.563467] nvme nvme5: frozen state error detected, reset controller
[ 334.615434] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[ 335.655445] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[ 335.663647] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
[ 335.670523] pcieport 0000:00:03.4: AER: broadcast slot_reset message
[ 335.676954] nvme nvme5: restart after slot reset
[ 335.684371] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)
[ 336.427724] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
[ 336.437510] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
[ 336.447215] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa824144d)
[ 336.457117] pcieport 0000:00:03.4: AER: broadcast resume message
[ 336.479575] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 336.494264] nvme nvme5: Removing after probe failure status: -19
[ 336.535861] pcieport 0000:00:03.4: AER: device recovery successful
[ 336.535899] nvme 0000:0b:00.0: vgaarb: pci_notify
[ 336.691465] pci 0000:0b:00.0: vgaarb: pci_notify
```
And again KIOXIA CD6 Device 1e0f:0007
This time recovery seems to works fine
```
[ 760.688465] pcieport 0000:80:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:85:00.0
[ 760.700437] pcieport 0000:80:03.4: AER: Uncorrected (Fatal) error received: 0000:85:00.0
[ 760.710238] nvme 0000:85:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[ 760.723600] pcieport 0000:80:03.4: AER: broadcast error_detected message
[ 760.732011] nvme nvme3: frozen state error detected, reset controller
[ 760.793311] pcieport 0000:80:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[ 761.819692] pcieport 0000:80:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[ 761.829594] pcieport 0000:80:03.4: AER: Root Port link has been reset (0)
[ 761.838156] pcieport 0000:80:03.4: AER: broadcast slot_reset message
[ 761.846263] nvme nvme3: restart after slot reset
[ 761.852721] nvme 0000:85:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xbc200000)
[ 761.863559] nvme 0000:85:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xbc210004)
[ 761.874389] nvme 0000:85:00.0: restoring config space at offset 0xc (was 0x0, writing 0x10)
[ 761.884634] nvme 0000:85:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
[ 761.895725] pcieport 0000:80:03.4: AER: broadcast resume message
[ 761.921897] nvme 0000:85:00.0: saving config space at offset 0x0 (reading 0x71e0f)
...
[ 762.055409] nvme 0000:85:00.0: saving config space at offset 0x38 (reading 0x0)
[ 762.064599] nvme 0000:85:00.0: saving config space at offset 0x3c (reading 0x1ff)
[ 762.079458] nvme nvme3: Shutdown timeout set to 16 seconds
[ 762.166218] nvme nvme3: 64/0/0 default/read/poll queues
[ 762.176470] pcieport 0000:80:03.4: AER: device recovery successful
```
Device is still visible:
# ls /dev/nvme3n1
/dev/nvme3n1
So looks like behavior is very similar on 6.1.9 to what I experience with 5.15 kernel
More information about the Linux-nvme
mailing list