[PATCH] nvme-pci: fix resume after AER recovery
Grochowski, Maciej
Maciej.Grochowski at sony.com
Wed Feb 1 14:58:40 PST 2023
Hi Keith!
I updated kernel on my machine into 5.15.87 sources and I run same experiment with aer_inject tool.
As a result I got followed error and nvme disappeared:
nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
full log
```
[ 679.061060] pcieport 0000:00:03.2: aer_inject: Injecting errors 00000000/00004000 into device 0000:02:00.0
[ 679.061100] pcieport 0000:00:03.2: AER: Uncorrected (Fatal) error received: 0000:02:00.0
[ 679.061111] nvme 0000:02:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[ 679.061115] pcieport 0000:00:03.2: AER: broadcast error_detected message
[ 679.061120] nvme nvme15: frozen state error detected, reset controller
[ 679.076520] pcieport 0000:00:03.2: pciehp: pending interrupts 0x0010 from Slot Status
[ 679.076528] pcieport 0000:00:03.2: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[ 680.103638] pcieport 0000:00:03.2: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[ 680.103660] pcieport 0000:00:03.2: pciehp: pending interrupts 0x0010 from Slot Status
[ 680.103670] pcieport 0000:00:03.2: AER: Root Port link has been reset (0)
[ 680.103674] pcieport 0000:00:03.2: AER: broadcast slot_reset message
[ 680.103677] nvme nvme15: restart after slot reset
[ 680.106193] nvme 0000:02:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
...
[ 680.166532] pcieport 0000:00:03.2: AER: broadcast resume message
[ 680.171640] nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 680.171825] nvme nvme15: Removing after probe failure status: -19
[ 680.177638] nvme15n1: detected capacity change from 3750748848 to 0
```
After that device /dev/nvme15 disappears /
Are there any quirks that needs to be performed for certain NVMe devices?
In my case I use "Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO"
I tried the same experiment on different AMD platform with "Samsung Electronics Co Ltd Device a824" it give me same result,
but when I did this test on KIOXIA NVME "1e0f:0007" it worked fine and was able to recover.
-----Original Message-----
From: Keith Busch <kbusch at kernel.org>
Sent: Tuesday, January 31, 2023 7:22 AM
To: Christoph Hellwig <hch at lst.de>
Cc: sagi at grimberg.me; linux-nvme at lists.infradead.org; Grochowski, Maciej <Maciej.Grochowski at sony.com>
Subject: Re: [PATCH] nvme-pci: fix resume after AER recovery
On Tue, Jan 31, 2023 at 09:58:28AM +0100, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 11:43:28AM -0700, Keith Busch wrote:
> > > Why isn't slot_reset being called after error_detected? Driver
> > > should be returning "RESULT_NEEDS_RESET", which should have the
> > > pcie error handling always invoke the slot_reset callback.
> >
> > Are you using an older kernel that doesn't have
> >
> >
> > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/ker
> > nel/git/torvalds/linux.git/commit?id=387c72cdd7fb6bef650fb078d0f6ae9
> > 682abf631__;!!JmoZiZGBv3RvKRSx!5wLCu5dVEfcVCU88GB-KhvnD4nWEP2QiJPMbe
> > oDDQSvL03dI7O_wXoRKBzAEF9o23C0usSRp8QqIloxstMc8$
>
> Oh, that does looks like the real fix. That being said, what is the
> point of flushing the reset_wor in nvme_error_resume?
It's so we don't try to handle a new error or remove before finishing recovery from the first one.
More information about the Linux-nvme
mailing list