[PATCH] nvme-pci: fix resume after AER recovery

Javier.gonz at samsung.com Javier.gonz at samsung.com
Mon Feb 6 06:02:20 PST 2023


On 03.02.2023 18:45, Grochowski, Maciej wrote:
>> > I run remove/rescan for this Samsung PM1733 and it looks like it works
>> > fine on both 5.15 and 6.1.9
>>
>> Sounds like the Samsung wants a longer, non-standard delay between SBR and reinit.
>
>Thanks for the suggestion.
>
>Hi Javier:
>
>We have 2 Samsung NVMe drives: PM9A3 and PM1733
>When we issue fatal AER via aer_inject these driver are not able to recover due to the
>"Unable to change power state from D3cold to D0, device inaccessible"
>
>Repeated log from previous mail (this is consistent behavior on 5.15 and 6.1 kernel)
>```
>[  334.527200] pcieport 0000:00:03.4: aer_inject: Injecting errors 00000000/00004000 into device 0000:0b:00.0
>[  334.537072] pcieport 0000:00:03.4: AER: Uncorrected (Fatal) error received: 0000:0b:00.0
>[  334.545320] nvme 0000:0b:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
>[  334.556682] pcieport 0000:00:03.4: AER: broadcast error_detected message
>[  334.563467] nvme nvme5: frozen state error detected, reset controller
>[  334.615434] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
>[  335.655445] pcieport 0000:00:03.4: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
>[  335.663647] pcieport 0000:00:03.4: AER: Root Port link has been reset (0)
>[  335.670523] pcieport 0000:00:03.4: AER: broadcast slot_reset message
>[  335.676954] nvme nvme5: restart after slot reset
>[  335.684371] nvme 0000:0b:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x10a)
>[  336.427724] nvme 0000:0b:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x1080200)
>[  336.437510] nvme 0000:0b:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100406)
>[  336.447215] nvme 0000:0b:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0xa824144d)
>[  336.457117] pcieport 0000:00:03.4: AER: broadcast resume message
>[  336.479575] nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
>[  336.494264] nvme nvme5: Removing after probe failure status: -19
>[  336.535861] pcieport 0000:00:03.4: AER: device recovery successful
>[  336.535899] nvme 0000:0b:00.0: vgaarb: pci_notify
>[  336.691465] pci 0000:0b:00.0: vgaarb: pci_notify
>```
>
>Same experiment for other NVMe vendors seems to works fine (I tried on KIOXIA NVME)
>is that something you can take a look at?

Thanks for the note Maciej. I will report this internally.

Keith, Christoph,

Is there a chance we can get a quirk for this for this FW. Seems like an
issue on our side that is creating problems.

Thanks,
Javier



More information about the Linux-nvme mailing list