[PATCH] nvme-pci: fix resume after AER recovery
Grochowski, Maciej
Maciej.Grochowski at sony.com
Tue Feb 7 11:05:08 PST 2023
I also updated the Kernel into 6.2-rc-7 to make sure we are on the same revision.
I repeated experiment and I am still getting error on both drives: PM1733 and PM9A3,
So I think there may be some platform/fw differences
I have two AMD Rome platforms that behave the same (RomeD8-2T and other custom made based on this design). I set these platform to do OS Handle first (instead of relying on platform FW).
Do you have any set of nvme-cli commands that I should issue for these drives so you can compare FW revision and other details?
I see also in your setup you have NVME connected via the bridge 0000:00:01.2 so it looks very similar to my platform.
-----Original Message-----
From: Klaus Jensen <its at irrelevant.dk>
Sent: Tuesday, February 7, 2023 2:37 AM
To: Javier.gonz at samsung.com
Cc: Grochowski, Maciej <Maciej.Grochowski at sony.com>; Keith Busch <kbusch at kernel.org>; Christoph Hellwig <hch at lst.de>; sagi at grimberg.me; linux-nvme at lists.infradead.org; Lewis, Nathaniel <Nathaniel.Lewis at sony.com>; Kanchan Joshi <joshi.k at samsung.com>; Klaus Jensen <k.jensen at samsung.com>
Subject: Re: [PATCH] nvme-pci: fix resume after AER recovery
On Feb 7 09:29, Javier.gonz at samsung.com wrote:
> On 07.02.2023 01:51, Grochowski, Maciej wrote:
> > I have tried suggested approach, with some modification: pci_device
> > in pci_reset_secondary_bus is actually the bridge not NVMe device
> > itself, thus I checked devices behind that bridge to see if any has
> > D0 bit and base on that logic I run the custom delay.
> >
> > Unfortunately even with this approach I see the same issue for both
> > Samsung drives, and based on kernel logs I can see that wait for
> > secondary bus reset get increased. Thus seems like this quirk don't
> > work for some reason. (I tried also increasing delays to different
> > values but it didn't work).
>
> Too bad.
>
> I will write you separately to get som dumps from the device. We have
> not seen this before, so we need to understand this a bit better.
>
> Regarding the quirk, we are looking into it. Will come with something
> in this thread later. Cc'ing Kanchan and Klaus.
>
I dug up a PM1733 and I am not immediately able to reproduce on 6.2-rc7.
With an aer-inject error file,
AER
PCI_ID 0000:04:00.0
UNCOR_STATUS MALF_TLP
HEADER_LOG 0 1 2 3
I'm getting a Fatal error with type "Inaccessible, (Unregistered Agent ID)", but it still recovers successfully:
pcieport 0000:00:01.2: aer_inject: Injecting errors 00000000/00040000 into device 0000:04:00.0
pcieport 0000:00:01.2: AER: Uncorrected (Fatal) error received: 0000:04:00.0
nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
nvme nvme1: frozen state error detected, reset controller
pcieport 0000:03:00.0: AER: Downstream Port link has been reset (0)
nvme nvme1: restart after slot reset
nvme nvme1: Shutdown timeout set to 10 seconds
nvme nvme1: 32/0/0 default/read/poll queues
pcieport 0000:03:00.0: AER: device recovery successful
Maciej, can you share firmware revision information and a bit more details on your reproducer/setup that might allow us to replicate?
Thanks,
Klaus
More information about the Linux-nvme
mailing list