nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Fri Mar 10 15:53:06 PST 2023

[+cc Lukas, beginning of thread:
https://lore.kernel.org/all/de1b20e5-ded1-0aae-2221-f5d470d91015@nvidia.com/]

On Fri, Mar 10, 2023 at 02:39:19PM -0800, Tushar Dave wrote:
> On 3/9/23 09:53, Bjorn Helgaas wrote:
> > On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
> > > On 3/7/23 03:59, Sagi Grimberg wrote:
> > > > On 3/2/23 02:09, Tushar Dave wrote:
> > > > > We are observing NVMe device disabled due to reset failure after
> > > > > injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
> > > > > I tried this on 2 different system and it is 100% reproducible with 6.2
> > > > > kernel.
> > > > > 
> > > > > On my system, Samsung NVMe SSD Controller PM173X is directly behind the
> > > > > Broadcom PCIe Switch Downstream Port.
> > > > > MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
> > > > > to 128B (keeping NVMe device MPS 512B).
> > > > > 
> > > > > e.g.
> > > > > # change MPS of PCIe switch (a9:10.0)
> > > > > $ setpci -v -s a9:10.0 cap_exp+0x8.w
> > > > > 0000:a9:10.0 (cap 10 @68) @70 = 0857
> > > > > $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
> > > > > 0000:a9:10.0 (cap 10 @68) @70 0817
> > > > > $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
> > > > >           DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
> > > > >               RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > > > >               MaxPayload 128 bytes, MaxReadReq 128 bytes
> > > > > 
> > > > > # run some traffic on nvme (ab:00.0)
> > > > > $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
> > > > > dd: error reading '/dev/nvme0n1': Input/output error
> > > > > 2+0 records in
> > > > > 2+0 records out
> > > > > 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
> > > > > 
> > > > > #kernel log:
> > > > > [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
> > > > > [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> > > > > [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
> > > > > status:0x2009 source:0x0000
> > > > > [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
> > > > > detected
> > > > > [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
> > > > > severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> > > > > [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
> > > > > status/mask=00040000/00180000
> > > > > [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
> > > > > (First)
> > > > > [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
> > > > > ab0000ff 00000001 d1fd0000
> > > > > [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> > > > > [  163.112095] nvme nvme0: frozen state error detected, reset controller
> > > > > [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
> > > > > (sct 0x3 / sc 0x71)
> > > > > [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> > > > > 0x4080700 phys_seg 4 prio class 2
> > > > > [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> > > > > [  163.390895] nvme nvme0: restart after slot reset
> > > > > [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
> > > > > (was 0x100, writing 0x1ff)
> > > > > [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
> > > > > (was 0x0, writing 0xe0600000)
> > > > > [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
> > > > > (was 0x4, writing 0xe0710004)
> > > > > [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
> > > > > (was 0x0, writing 0x8)
> > > > > [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
> > > > > (was 0x100000, writing 0x100546)
> > > > > [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
> > > > > [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
> > > > > (reading 0xa824144d)
> > > > > [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
> > > > > (reading 0x100546)
> > > > > [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
> > > > > (reading 0x1080200)
> > > > > [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
> > > > > (reading 0x8)
> > > > > [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
> > > > > (reading 0xe0710004)
> > > > > [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
> > > > > (reading 0x0)
> > > > > [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
> > > > > (reading 0x0)
> > > > > [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
> > > > > (reading 0x0)
> > > > > [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
> > > > > (reading 0x0)
> > > > > [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
> > > > > (reading 0x0)
> > > > > [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
> > > > > (reading 0x0)
> > > > > [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
> > > > > (reading 0xa80a144d)
> > > > > [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
> > > > > (reading 0xe0600000)
> > > > > [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
> > > > > (reading 0x40)
> > > > > [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
> > > > > (reading 0x0)
> > > > > [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
> > > > > (reading 0x1ff)
> > > > > [  210.089132] block nvme0n1: no usable path - requeuing I/O
> > > > > [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> > > > > [  223.825236] nvme nvme0: Identify Controller failed (-4)
> > > > > [  223.832145] nvme nvme0: Disabling device after reset failure: -5
> > > > 
> > > > At this point the device is not going to recover.
> > >
> > > Yes, I agree.
> > > 
> > > I looked little bit more and found that nvme reset failure and second DPC,
> > > both were due to nvme_slot_reset() restoring MPS as part of
> > > pci_restore_state().
> > > 
> > > AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
> > > _default_ value 128B (this is likely due to DPC link retraining). However as
> > > part of software AER recovery, nvme_slot_reset() restores device state, and
> > > that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
> > > still remains at 128B).
> > > 
> > > At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
> > > device, malformedTLP again getting generated and that causes second DPC,
> > > makes NVMe controller reset to fail as well.
> > 
> > This sounds like the behavior I expect.  IIUC:
> > 
> >    - Switch and NVMe MPS are 512B
> >    - NVMe config space saved (including MPS=512B)
> >    - You change Switch MPS to 128B
> >    - NVMe does DMA with payload > 128B
> >    - Switch reports Malformed TLP because TLP is larger than its MPS
> >    - Recovery resets NVMe, which sets MPS to the default of 128B
> >    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
> >    - Subsequent NVMe DMA with payload > 128B repeats cycle
> > 
> > What do you think *should* be happening here?  I don't see a PCI
> > problem here.  If you change MPS on the Switch without coordinating
> > with NVMe, things aren't going to work.  Or am I missing something?
> 
> I agree this is expected but there are instances where I do _not_ see the
> issue occurring. That is due to involvement of pciehp, which add and
> configure nvme device - (coordinates MPS with pcie switch, and the new MPS
> will get saved too. So future tests of this kind won't reproduce this issue
> and that is understood).
> 
> IMO though, the result of the test should be consistent.
> Either pciehp/DPC should take care of device recovery 100% all the time;
> Or we consider nvme recovery failure as an expected result because MPS of
> pcie switch got changed without coordinating with nvme.
> 
> What do you think?

In the log below, pciehp obviously is enabled; should I infer that in
the log above, it is not?

Generally we've avoided handling a device reset as a remove/add event
because upper layers can't deal well with that.  But in the log below
it looks like pciehp *did* treat the DPC containment as a remove/add,
which of course involves configuring the "new" device and its MPS
settings.

  [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
  [  217.071217] nvme nvme0: restart after slot reset
  [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored (recovered by DPC)
  [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: lnk_status = 2044
  [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
  [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:ab:00
  [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
  [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, DEVCTL: 0x5957

The .slot_reset() method (nvme_slot_reset()) is called *after* the
device has been reset, and the device is supposed to be ready for the
driver to use it.  But here it looks like pciehp thinks ab:00.0 is not
present, so it removes it.  Later ab:00.0 is present again, so we
re-enumerate it:

  [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
  [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
  [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: lnk_status = 2044
  [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802

What kernel are you testing?  53b54ad074de ("PCI/DPC: Await readiness
of secondary bus after reset") looks like it could be related, but
you'd have to be using v6.3-rc1 or later to get it.

> e.g. [ when pciehp takes care of things]
> 
> [  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from
> Slot Status
> [  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
> [  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009
> source:0x0000
> [  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected
> (Fatal), type=Transaction Layer, (Receiver ID)
> [  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error
> status/mask=00040000/00180000
> [  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff
> 00000102 276fe000
> [  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [  216.678898] nvme nvme0: frozen state error detected, reset controller
> [  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct
> 0x3 / sc 0x71)
> [  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> 0x4080700 phys_seg 4 prio class 2
> [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [  217.071217] nvme nvme0: restart after slot reset
> [  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state
> DEVCTL: 0x2910
> [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up
> ignored (recovered by DPC)
> [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
> lnk_status = 2044
> [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
> [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device:
> domain:bus:dev = 0000:ab:00
> [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was
> 0x100, writing 0x1ff)
> [  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was
> 0x0, writing 0xe0600000)
> [  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was
> 0x4, writing 0xe0710004)
> [  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was
> 0x0, writing 0x8)
> [  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was
> 0x100000, writing 0x100546)
> [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state,
> DEVCTL: 0x5957
> [  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
> [  217.071467] nvme 0000:ab:00.0: PME# disabled
> [  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
> [  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
> [  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> [  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
> [  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
> [  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.311863] pci 0000:ab:00.0: device released
> [  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
> lnk_status = 2044
> [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
> [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
> [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status:
> lnk_status = 2044
> [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
> [  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
> [  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
> [  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
> [  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
> [  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem
> 0xe0610000-0xe070ffff 64bit] (contains BAR0 for 32 VFs)
> [  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to
> [bus ab] add_size 1000
> [  217.456413] pcieport 0000:a9:10.0: bridge window [mem
> 0x00100000-0x000fffff 64bit pref] to [bus ab] add_size 200000 add_align
> 100000
> [  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
> 0x00200000 64bit pref]
> [  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
> 0x00200000 64bit pref]
> [  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
> [  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
> [  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
> 0x00200000 64bit pref]
> [  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
> 0x00200000 64bit pref]
> [  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
> [  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
> [  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
> [  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
> [  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
> [  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
> [  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
> [  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
> [  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
> [  217.457452] nvme nvme10: pci function 0000:ab:00.0
> [  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading
> 0xa824144d)
> [  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading
> 0x100546)
> [  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading
> 0x1080200)
> [  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
> [  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10
> (reading 0xe0610004)
> [  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
> [  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
> [  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
> [  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
> [  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
> [  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
> [  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c
> (reading 0xa80a144d)
> [  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30
> (reading 0xe0600000)
> [  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
> [  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
> [  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
> [  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
> [  217.520625] nvme nvme10: 63/0/0 default/read/poll queues