nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Tushar Dave tdave at nvidia.com
Wed Mar 1 16:09:28 PST 2023


Hi,

We are observing NVMe device disabled due to reset failure after injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
I tried this on 2 different system and it is 100% reproducible with 6.2 kernel.

On my system, Samsung NVMe SSD Controller PM173X is directly behind the Broadcom PCIe Switch Downstream Port.
MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch to 128B (keeping NVMe device MPS 512B).

e.g.
# change MPS of PCIe switch (a9:10.0)
$ setpci -v -s a9:10.0 cap_exp+0x8.w
0000:a9:10.0 (cap 10 @68) @70 = 0857
$ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
0000:a9:10.0 (cap 10 @68) @70 0817
$ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 128 bytes

# run some traffic on nvme (ab:00.0)
$ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
dd: error reading '/dev/nvme0n1': Input/output error
2+0 records in
2+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s

#kernel log:
[  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
[  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[  163.049071] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
[  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
[  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 d1fd0000
[  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[  163.112095] nvme nvme0: frozen state error detected, reset controller
[  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct 0x3 / sc 0x71)
[  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags 0x4080700 phys_seg 4 prio class 2
[  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[  163.390895] nvme nvme0: restart after slot reset
[  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
[  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
[  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
[  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
[  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
[  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
[  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 0xa824144d)
[  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 0x100546)
[  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 0x1080200)
[  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
[  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 0xe0710004)
[  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
[  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
[  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
[  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
[  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
[  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
[  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 0xa80a144d)
[  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 0xe0600000)
[  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
[  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
[  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
[  210.089132] block nvme0n1: no usable path - requeuing I/O
[  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  223.825236] nvme nvme0: Identify Controller failed (-4)
[  223.832145] nvme nvme0: Disabling device after reset failure: -5
[  223.876833] Buffer I/O error on dev nvme0n1, logical block 2, async page read
[  223.876939] pcieport 0000:a9:10.0: AER: device recovery successful
[  223.893404] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[  223.901469] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
[  223.938902] pcieport 0000:a5:01.0: EDR: EDR event received
[  223.946077] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[  223.953901] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
[  223.963243] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[  223.971691] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[  223.984144] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
[  223.993966] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[  224.002023] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 1e659000
[  224.011644] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[  224.019604] nvme nvme0: frozen state error detected, reset controller
[  224.236597] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[  224.244676] nvme nvme0: restart after slot reset
[  224.250584] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
[  224.260945] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
[  224.271460] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
[  224.282012] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
[  224.291713] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
[  224.302430] pcieport 0000:a9:10.0: AER: broadcast resume message
[  224.309618] pcieport 0000:a9:10.0: AER: device recovery successful
[  224.316968] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[  224.324865] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80

After test, NVMe device still shows up in lspci and I can read pcie config space but cannot read/write from/to the nvme controller.

I am trying to narrow down but would appreciate help from linux-nvme.

Thanks.
-Tushar



More information about the Linux-nvme mailing list