[PATCH] nvme-pci: fix resume after AER recovery

Tue Feb 7 11:05:08 PST 2023

I also updated the Kernel into 6.2-rc-7 to make sure we are on the same revision.
I repeated experiment and I am still getting error on both drives: PM1733 and PM9A3,
So I think there may be some platform/fw differences 

I have two AMD Rome platforms that behave the same (RomeD8-2T and other custom made based on this design). I set these platform to do OS Handle first (instead of relying on platform FW).

Do you have any set of nvme-cli commands that I should issue for these drives so you can compare FW revision and other details?

I see also in your setup you have NVME connected via the bridge 0000:00:01.2 so it looks very similar to my platform.

-----Original Message-----
From: Klaus Jensen <its at irrelevant.dk> 
Sent: Tuesday, February 7, 2023 2:37 AM
To: Javier.gonz at samsung.com
Cc: Grochowski, Maciej <Maciej.Grochowski at sony.com>; Keith Busch <kbusch at kernel.org>; Christoph Hellwig <hch at lst.de>; sagi at grimberg.me; linux-nvme at lists.infradead.org; Lewis, Nathaniel <Nathaniel.Lewis at sony.com>; Kanchan Joshi <joshi.k at samsung.com>; Klaus Jensen <k.jensen at samsung.com>
Subject: Re: [PATCH] nvme-pci: fix resume after AER recovery

On Feb  7 09:29, Javier.gonz at samsung.com wrote:
> On 07.02.2023 01:51, Grochowski, Maciej wrote:
> > I have tried suggested approach, with some modification: pci_device 
> > in pci_reset_secondary_bus is actually the bridge not NVMe device 
> > itself, thus I checked devices behind that bridge to see if any has 
> > D0 bit and base on that logic I run the custom delay.
> > 
> > Unfortunately even with this approach I see the same issue for both 
> > Samsung drives, and based on kernel logs I can see that wait for 
> > secondary bus reset get increased.  Thus seems like this quirk don't 
> > work for some reason. (I tried also increasing delays to different 
> > values but it didn't work).
> 
> Too bad.
> 
> I will write you separately to get som dumps from the device. We have 
> not seen this before, so we need to understand this a bit better.
> 
> Regarding the quirk, we are looking into it. Will come with something 
> in this thread later. Cc'ing Kanchan and Klaus.
> 

I dug up a PM1733 and I am not immediately able to reproduce on 6.2-rc7.

With an aer-inject error file,

  AER
  PCI_ID 0000:04:00.0
  UNCOR_STATUS MALF_TLP
  HEADER_LOG 0 1 2 3

I'm getting a Fatal error with type "Inaccessible, (Unregistered Agent ID)", but it still recovers successfully:

  pcieport 0000:00:01.2: aer_inject: Injecting errors 00000000/00040000 into device 0000:04:00.0
  pcieport 0000:00:01.2: AER: Uncorrected (Fatal) error received: 0000:04:00.0
  nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
  nvme nvme1: frozen state error detected, reset controller
  pcieport 0000:03:00.0: AER: Downstream Port link has been reset (0)
  nvme nvme1: restart after slot reset
  nvme nvme1: Shutdown timeout set to 10 seconds
  nvme nvme1: 32/0/0 default/read/poll queues
  pcieport 0000:03:00.0: AER: device recovery successful

Maciej, can you share firmware revision information and a bit more details on your reproducer/setup that might allow us to replicate?

Thanks,
Klaus