[PATCH] nvme-pci: fix resume after AER recovery

Grochowski, Maciej Maciej.Grochowski at sony.com
Wed Feb 1 14:58:40 PST 2023


Hi Keith!

I updated kernel on my machine into 5.15.87 sources and I run same experiment with aer_inject tool.
As a result I got followed error and nvme disappeared:

nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)

full log
```
[  679.061060] pcieport 0000:00:03.2: aer_inject: Injecting errors 00000000/00004000 into device 0000:02:00.0
[  679.061100] pcieport 0000:00:03.2: AER: Uncorrected (Fatal) error received: 0000:02:00.0
[  679.061111] nvme 0000:02:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  679.061115] pcieport 0000:00:03.2: AER: broadcast error_detected message
[  679.061120] nvme nvme15: frozen state error detected, reset controller
[  679.076520] pcieport 0000:00:03.2: pciehp: pending interrupts 0x0010 from Slot Status
[  679.076528] pcieport 0000:00:03.2: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[  680.103638] pcieport 0000:00:03.2: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[  680.103660] pcieport 0000:00:03.2: pciehp: pending interrupts 0x0010 from Slot Status
[  680.103670] pcieport 0000:00:03.2: AER: Root Port link has been reset (0)
[  680.103674] pcieport 0000:00:03.2: AER: broadcast slot_reset message
[  680.103677] nvme nvme15: restart after slot reset
[  680.106193] nvme 0000:02:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
...
[  680.166532] pcieport 0000:00:03.2: AER: broadcast resume message
[  680.171640] nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[  680.171825] nvme nvme15: Removing after probe failure status: -19
[  680.177638] nvme15n1: detected capacity change from 3750748848 to 0
```
After that device /dev/nvme15 disappears /
Are there any quirks that needs to be performed for certain NVMe devices?
In my case I use "Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO"

I tried the same experiment on different AMD platform with  "Samsung Electronics Co Ltd Device a824" it give me same result, 
but when I did this test on KIOXIA NVME "1e0f:0007" it worked fine and was able to recover.


-----Original Message-----
From: Keith Busch <kbusch at kernel.org> 
Sent: Tuesday, January 31, 2023 7:22 AM
To: Christoph Hellwig <hch at lst.de>
Cc: sagi at grimberg.me; linux-nvme at lists.infradead.org; Grochowski, Maciej <Maciej.Grochowski at sony.com>
Subject: Re: [PATCH] nvme-pci: fix resume after AER recovery

On Tue, Jan 31, 2023 at 09:58:28AM +0100, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 11:43:28AM -0700, Keith Busch wrote:
> > > Why isn't slot_reset being called after error_detected? Driver 
> > > should be returning "RESULT_NEEDS_RESET", which should have the 
> > > pcie error handling always invoke the slot_reset callback.
> > 
> > Are you using an older kernel that doesn't have
> > 
> >   
> > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/ker
> > nel/git/torvalds/linux.git/commit?id=387c72cdd7fb6bef650fb078d0f6ae9
> > 682abf631__;!!JmoZiZGBv3RvKRSx!5wLCu5dVEfcVCU88GB-KhvnD4nWEP2QiJPMbe
> > oDDQSvL03dI7O_wXoRKBzAEF9o23C0usSRp8QqIloxstMc8$
> 
> Oh, that does looks like the real fix.  That being said, what is the 
> point of flushing the reset_wor in nvme_error_resume?

It's so we don't try to handle a new error or remove before finishing recovery from the first one.



More information about the Linux-nvme mailing list