nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
Sagi Grimberg
sagi at grimberg.me
Tue Mar 7 03:59:59 PST 2023
On 3/2/23 02:09, Tushar Dave wrote:
> Hi,
>
> We are observing NVMe device disabled due to reset failure after
> injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
> I tried this on 2 different system and it is 100% reproducible with 6.2
> kernel.
>
> On my system, Samsung NVMe SSD Controller PM173X is directly behind the
> Broadcom PCIe Switch Downstream Port.
> MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
> to 128B (keeping NVMe device MPS 512B).
>
> e.g.
> # change MPS of PCIe switch (a9:10.0)
> $ setpci -v -s a9:10.0 cap_exp+0x8.w
> 0000:a9:10.0 (cap 10 @68) @70 = 0857
> $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
> 0000:a9:10.0 (cap 10 @68) @70 0817
> $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 128 bytes
>
> # run some traffic on nvme (ab:00.0)
> $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
> dd: error reading '/dev/nvme0n1': Input/output error
> 2+0 records in
> 2+0 records out
> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
>
> #kernel log:
> [ 163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 163.049071] pcieport 0000:a9:10.0: DPC: containment event,
> status:0x2009 source:0x0000
> [ 163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
> detected
> [ 163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [ 163.078151] pcieport 0000:a9:10.0: device [1000:c030] error
> status/mask=00040000/00180000
> [ 163.087613] pcieport 0000:a9:10.0: [18] MalfTLP
> (First)
> [ 163.095281] pcieport 0000:a9:10.0: AER: TLP Header: 60000080
> ab0000ff 00000001 d1fd0000
> [ 163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 163.112095] nvme nvme0: frozen state error detected, reset controller
> [ 163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
> (sct 0x3 / sc 0x71)
> [ 163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> 0x4080700 phys_seg 4 prio class 2
> [ 163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 163.390895] nvme nvme0: restart after slot reset
> [ 163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
> (was 0x100, writing 0x1ff)
> [ 163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
> (was 0x0, writing 0xe0600000)
> [ 163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
> (was 0x4, writing 0xe0710004)
> [ 163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
> (was 0x0, writing 0x8)
> [ 163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
> (was 0x100000, writing 0x100546)
> [ 163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
> (reading 0xa824144d)
> [ 163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
> (reading 0x100546)
> [ 163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
> (reading 0x1080200)
> [ 163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
> (reading 0x8)
> [ 163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
> (reading 0xe0710004)
> [ 163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
> (reading 0x0)
> [ 163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
> (reading 0x0)
> [ 163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
> (reading 0x0)
> [ 163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
> (reading 0x0)
> [ 163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
> (reading 0x0)
> [ 163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
> (reading 0x0)
> [ 163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
> (reading 0xa80a144d)
> [ 163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
> (reading 0xe0600000)
> [ 163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
> (reading 0x40)
> [ 163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
> (reading 0x0)
> [ 163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
> (reading 0x1ff)
> [ 210.089132] block nvme0n1: no usable path - requeuing I/O
> [ 223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> [ 223.825236] nvme nvme0: Identify Controller failed (-4)
> [ 223.832145] nvme nvme0: Disabling device after reset failure: -5
At this point the device is not going to recover.
> [ 223.876833] Buffer I/O error on dev nvme0n1, logical block 2, async
> page read
> [ 223.876939] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 223.893404] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 223.901469] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> [ 223.938902] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 223.946077] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 223.953901] pcieport 0000:a9:10.0: DPC: containment event,
> status:0x2009 source:0x0000
> [ 223.963243] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
> detected
> [ 223.971691] pcieport 0000:a9:10.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [ 223.984144] pcieport 0000:a9:10.0: device [1000:c030] error
> status/mask=00040000/00180000
> [ 223.993966] pcieport 0000:a9:10.0: [18] MalfTLP
> (First)
> [ 224.002023] pcieport 0000:a9:10.0: AER: TLP Header: 60000080
> ab0000ff 00000001 1e659000
> [ 224.011644] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 224.019604] nvme nvme0: frozen state error detected, reset controller
> [ 224.236597] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 224.244676] nvme nvme0: restart after slot reset
> [ 224.250584] nvme 0000:ab:00.0: restoring config space at offset 0x3c
> (was 0x100, writing 0x1ff)
> [ 224.260945] nvme 0000:ab:00.0: restoring config space at offset 0x30
> (was 0x0, writing 0xe0600000)
> [ 224.271460] nvme 0000:ab:00.0: restoring config space at offset 0x10
> (was 0x4, writing 0xe0710004)
> [ 224.282012] nvme 0000:ab:00.0: restoring config space at offset 0xc
> (was 0x0, writing 0x8)
> [ 224.291713] nvme 0000:ab:00.0: restoring config space at offset 0x4
> (was 0x100000, writing 0x100546)
> [ 224.302430] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 224.309618] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 224.316968] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 224.324865] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>
> After test, NVMe device still shows up in lspci and I can read pcie
> config space but cannot read/write from/to the nvme controller.
>
> I am trying to narrow down but would appreciate help from linux-nvme.
The issue is that during the reset the controller failed to identify
the controller, due to a second failure. The nvme controller does not
know how to handle that.
More information about the Linux-nvme
mailing list