[PATCH] nvme-pci: fix resume after AER recovery

Grochowski, Maciej Maciej.Grochowski at sony.com
Thu Feb 2 10:47:35 PST 2023


> I've been trying to look for this code in latest upstream and it has changed entirely.  Any chance you could do a quick run with Linux 6.1?

Hi Christoph,
I tried same test on 6.1.9 (uname -r -> 6.1.9) for Samsung PM9A3

Samsung Electronics Co Ltd Device a80a

It gave me the result as on older kernel: failure in power state change and device disappeared after test (logs below)
```
[  365.052300] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
[  365.062200] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
[  365.070439] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  365.081824] pcieport 0000:00:03.4: AER: broadcast error_detected message
[  365.088635] nvme nvme5: frozen state error detected, reset controller
[  365.157742] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[  366.205193] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[  366.213394] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
[  366.220312] pcieport 0000:00:03.4: AER: broadcast slot_reset message
[  366.226771] nvme nvme5: restart after slot reset
[  366.232018] pcieport 0000:00:03.4: re-enabling LTR
[  366.239088] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)  
...
[  366.994113] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
[  367.003924] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
[  367.013663] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa80a144d)
[  367.023595] pcieport 0000:00:03.4: AER: broadcast resume message
[  367.045269] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  367.061721] nvme nvme5: Removing after probe failure status: -19
[  367.089330] nvme5n1: detected capacity change from 3750748848 to 0
[  367.096431] pcieport 0000:00:03.4: AER: device recovery successful
[  367.096470] nvme 0000:0b:00.0: vgaarb: pci_notify
[  367.226016] pci 0000:0b:00.0: vgaarb: pci_notify
```
# ls /dev/nvme5n1
ls: cannot access '/dev/nvme5n1': No such file or directory

I run the same test for another Samsung drive: PM1733
0b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801

```
[  334.527200] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
[  334.537072] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
[  334.545320] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  334.556682] pcieport 0000:00:03.4: AER: broadcast error_detected message
[  334.563467] nvme nvme5: frozen state error detected, reset controller
[  334.615434] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[  335.655445] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[  335.663647] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
[  335.670523] pcieport 0000:00:03.4: AER: broadcast slot_reset message
[  335.676954] nvme nvme5: restart after slot reset
[  335.684371] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)
[  336.427724] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
[  336.437510] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
[  336.447215] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa824144d)
[  336.457117] pcieport 0000:00:03.4: AER: broadcast resume message
[  336.479575] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  336.494264] nvme nvme5: Removing after probe failure status: -19
[  336.535861] pcieport 0000:00:03.4: AER: device recovery successful
[  336.535899] nvme 0000:0b:00.0: vgaarb: pci_notify
[  336.691465] pci 0000:0b:00.0: vgaarb: pci_notify
```

And again KIOXIA CD6 Device 1e0f:0007
This time recovery seems to works fine 
```
[  760.688465] pcieport 0000:80:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:85:00.0
[  760.700437] pcieport 0000:80:03.4: AER: Uncorrected (Fatal) error received: 0000:85:00.0
[  760.710238] nvme 0000:85:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  760.723600] pcieport 0000:80:03.4: AER: broadcast error_detected message
[  760.732011] nvme nvme3: frozen state error detected, reset controller
[  760.793311] pcieport 0000:80:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[  761.819692] pcieport 0000:80:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[  761.829594] pcieport 0000:80:03.4: AER: Root Port link has been reset (0)
[  761.838156] pcieport 0000:80:03.4: AER: broadcast slot_reset message
[  761.846263] nvme nvme3: restart after slot reset
[  761.852721] nvme 0000:85:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xbc200000)
[  761.863559] nvme 0000:85:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xbc210004)
[  761.874389] nvme 0000:85:00.0: restoring config space at offset 0xc (was 0x0, writing 0x10)
[  761.884634] nvme 0000:85:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
[  761.895725] pcieport 0000:80:03.4: AER: broadcast resume message
[  761.921897] nvme 0000:85:00.0: saving config space at offset 0x0 (reading 0x71e0f)
...
[  762.055409] nvme 0000:85:00.0: saving config space at offset 0x38 (reading 0x0)
[  762.064599] nvme 0000:85:00.0: saving config space at offset 0x3c (reading 0x1ff)
[  762.079458] nvme nvme3: Shutdown timeout set to 16 seconds
[  762.166218] nvme nvme3: 64/0/0 default/read/poll queues
[  762.176470] pcieport 0000:80:03.4: AER: device recovery successful
```
Device is still visible:

# ls /dev/nvme3n1
/dev/nvme3n1

So looks like behavior is very similar on 6.1.9 to what I experience with 5.15 kernel




More information about the Linux-nvme mailing list