nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Fri Mar 10 17:45:48 PST 2023

On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> [+cc Lukas, beginning of thread:
> https://lore.kernel.org/all/de1b20e5-ded1-0aae-2221-f5d470d91015@nvidia.com/]
> 
> On Fri, Mar 10, 2023 at 02:39:19PM -0800, Tushar Dave wrote:
>> On 3/9/23 09:53, Bjorn Helgaas wrote:
>>> On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
>>>> On 3/7/23 03:59, Sagi Grimberg wrote:
>>>>> On 3/2/23 02:09, Tushar Dave wrote:
>>>>>> We are observing NVMe device disabled due to reset failure after
>>>>>> injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
>>>>>> I tried this on 2 different system and it is 100% reproducible with 6.2
>>>>>> kernel.
>>>>>>
>>>>>> On my system, Samsung NVMe SSD Controller PM173X is directly behind the
>>>>>> Broadcom PCIe Switch Downstream Port.
>>>>>> MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
>>>>>> to 128B (keeping NVMe device MPS 512B).
>>>>>>
>>>>>> e.g.
>>>>>> # change MPS of PCIe switch (a9:10.0)
>>>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w
>>>>>> 0000:a9:10.0 (cap 10 @68) @70 = 0857
>>>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
>>>>>> 0000:a9:10.0 (cap 10 @68) @70 0817
>>>>>> $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
>>>>>>            DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
>>>>>>                RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>>>                MaxPayload 128 bytes, MaxReadReq 128 bytes
>>>>>>
>>>>>> # run some traffic on nvme (ab:00.0)
>>>>>> $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
>>>>>> dd: error reading '/dev/nvme0n1': Input/output error
>>>>>> 2+0 records in
>>>>>> 2+0 records out
>>>>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
>>>>>>
>>>>>> #kernel log:
>>>>>> [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
>>>>>> [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>>>>> [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
>>>>>> status:0x2009 source:0x0000
>>>>>> [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
>>>>>> detected
>>>>>> [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
>>>>>> severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>>>>> [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
>>>>>> status/mask=00040000/00180000
>>>>>> [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
>>>>>> (First)
>>>>>> [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
>>>>>> ab0000ff 00000001 d1fd0000
>>>>>> [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>>>>> [  163.112095] nvme nvme0: frozen state error detected, reset controller
>>>>>> [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
>>>>>> (sct 0x3 / sc 0x71)
>>>>>> [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
>>>>>> 0x4080700 phys_seg 4 prio class 2
>>>>>> [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>>>>> [  163.390895] nvme nvme0: restart after slot reset
>>>>>> [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
>>>>>> (was 0x100, writing 0x1ff)
>>>>>> [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
>>>>>> (was 0x0, writing 0xe0600000)
>>>>>> [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
>>>>>> (was 0x4, writing 0xe0710004)
>>>>>> [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
>>>>>> (was 0x0, writing 0x8)
>>>>>> [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
>>>>>> (was 0x100000, writing 0x100546)
>>>>>> [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
>>>>>> [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
>>>>>> (reading 0xa824144d)
>>>>>> [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
>>>>>> (reading 0x100546)
>>>>>> [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
>>>>>> (reading 0x1080200)
>>>>>> [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
>>>>>> (reading 0x8)
>>>>>> [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
>>>>>> (reading 0xe0710004)
>>>>>> [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
>>>>>> (reading 0x0)
>>>>>> [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
>>>>>> (reading 0x0)
>>>>>> [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
>>>>>> (reading 0x0)
>>>>>> [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
>>>>>> (reading 0x0)
>>>>>> [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
>>>>>> (reading 0x0)
>>>>>> [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
>>>>>> (reading 0x0)
>>>>>> [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
>>>>>> (reading 0xa80a144d)
>>>>>> [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
>>>>>> (reading 0xe0600000)
>>>>>> [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
>>>>>> (reading 0x40)
>>>>>> [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
>>>>>> (reading 0x0)
>>>>>> [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
>>>>>> (reading 0x1ff)
>>>>>> [  210.089132] block nvme0n1: no usable path - requeuing I/O
>>>>>> [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
>>>>>> [  223.825236] nvme nvme0: Identify Controller failed (-4)
>>>>>> [  223.832145] nvme nvme0: Disabling device after reset failure: -5
>>>>>
>>>>> At this point the device is not going to recover.
>>>>
>>>> Yes, I agree.
>>>>
>>>> I looked little bit more and found that nvme reset failure and second DPC,
>>>> both were due to nvme_slot_reset() restoring MPS as part of
>>>> pci_restore_state().
>>>>
>>>> AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
>>>> _default_ value 128B (this is likely due to DPC link retraining). However as
>>>> part of software AER recovery, nvme_slot_reset() restores device state, and
>>>> that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
>>>> still remains at 128B).
>>>>
>>>> At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
>>>> device, malformedTLP again getting generated and that causes second DPC,
>>>> makes NVMe controller reset to fail as well.
>>>
>>> This sounds like the behavior I expect.  IIUC:
>>>
>>>     - Switch and NVMe MPS are 512B
>>>     - NVMe config space saved (including MPS=512B)
>>>     - You change Switch MPS to 128B
>>>     - NVMe does DMA with payload > 128B
>>>     - Switch reports Malformed TLP because TLP is larger than its MPS
>>>     - Recovery resets NVMe, which sets MPS to the default of 128B
>>>     - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>>>     - Subsequent NVMe DMA with payload > 128B repeats cycle
>>>
>>> What do you think *should* be happening here?  I don't see a PCI
>>> problem here.  If you change MPS on the Switch without coordinating
>>> with NVMe, things aren't going to work.  Or am I missing something?
>>
>> I agree this is expected but there are instances where I do _not_ see the
>> issue occurring. That is due to involvement of pciehp, which add and
>> configure nvme device - (coordinates MPS with pcie switch, and the new MPS
>> will get saved too. So future tests of this kind won't reproduce this issue
>> and that is understood).
>>
>> IMO though, the result of the test should be consistent.
>> Either pciehp/DPC should take care of device recovery 100% all the time;
>> Or we consider nvme recovery failure as an expected result because MPS of
>> pcie switch got changed without coordinating with nvme.
>>
>> What do you think?
> 
> In the log below, pciehp obviously is enabled; should I infer that in
> the log above, it is not?

pciehp is enabled all the time. In the log above and below.
I do not have answer yet why pciehp shows-up only in some tests (due to DPC link 
down/up) and not in others like you noticed in both the logs.

> 
> Generally we've avoided handling a device reset as a remove/add event
> because upper layers can't deal well with that.  But in the log below
> it looks like pciehp *did* treat the DPC containment as a remove/add,
> which of course involves configuring the "new" device and its MPS
> settings.

yes and that puzzled me why? especially when"Link Down/Up ignored (recovered by 
DPC)". Do we still have race somewhere, I am not sure.

> 
>    [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>    [  217.071217] nvme nvme0: restart after slot reset
>    [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored (recovered by DPC)
>    [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: lnk_status = 2044
>    [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
>    [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:ab:00
>    [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>    [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, DEVCTL: 0x5957
> 
> The .slot_reset() method (nvme_slot_reset()) is called *after* the
> device has been reset, and the device is supposed to be ready for the
> driver to use it.  But here it looks like pciehp thinks ab:00.0 is not
> present, so it removes it.  Later ab:00.0 is present again, so we
> re-enumerate it:

that is correct.

> 
>    [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
>    [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
>    [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: lnk_status = 2044
>    [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
> 
> What kernel are you testing?  53b54ad074de ("PCI/DPC: Await readiness
> of secondary bus after reset") looks like it could be related, but
> you'd have to be using v6.3-rc1 or later to get it.

I am on v6.2 but I will give a try to v6.3-rc1 and get back.

> 
>> e.g. [ when pciehp takes care of things]
>>
>> [  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from
>> Slot Status
>> [  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
>> [  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>> [  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009
>> source:0x0000
>> [  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>> [  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected
>> (Fatal), type=Transaction Layer, (Receiver ID)
>> [  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error
>> status/mask=00040000/00180000
>> [  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>> [  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff
>> 00000102 276fe000
>> [  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>> [  216.678898] nvme nvme0: frozen state error detected, reset controller
>> [  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct
>> 0x3 / sc 0x71)
>> [  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
>> 0x4080700 phys_seg 4 prio class 2
>> [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>> [  217.071217] nvme nvme0: restart after slot reset
>> [  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state
>> DEVCTL: 0x2910
>> [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up
>> ignored (recovered by DPC)
>> [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
>> lnk_status = 2044
>> [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
>> [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device:
>> domain:bus:dev = 0000:ab:00
>> [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was
>> 0x100, writing 0x1ff)
>> [  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was
>> 0x0, writing 0xe0600000)
>> [  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was
>> 0x4, writing 0xe0710004)
>> [  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was
>> 0x0, writing 0x8)
>> [  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was
>> 0x100000, writing 0x100546)
>> [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state,
>> DEVCTL: 0x5957
>> [  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
>> [  217.071467] nvme 0000:ab:00.0: PME# disabled
>> [  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
>> [  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>> [  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
>> [  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>> [  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
>> [  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
>> [  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.311863] pci 0000:ab:00.0: device released
>> [  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
>> lnk_status = 2044
>> [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
>> [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
>> [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status:
>> lnk_status = 2044
>> [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
>> [  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
>> [  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
>> [  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
>> [  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
>> [  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem
>> 0xe0610000-0xe070ffff 64bit] (contains BAR0 for 32 VFs)
>> [  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to
>> [bus ab] add_size 1000
>> [  217.456413] pcieport 0000:a9:10.0: bridge window [mem
>> 0x00100000-0x000fffff 64bit pref] to [bus ab] add_size 200000 add_align
>> 100000
>> [  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
>> 0x00200000 64bit pref]
>> [  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
>> 0x00200000 64bit pref]
>> [  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
>> [  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
>> [  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
>> 0x00200000 64bit pref]
>> [  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
>> 0x00200000 64bit pref]
>> [  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
>> [  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
>> [  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
>> [  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
>> [  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
>> [  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
>> [  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
>> [  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
>> [  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
>> [  217.457452] nvme nvme10: pci function 0000:ab:00.0
>> [  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading
>> 0xa824144d)
>> [  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading
>> 0x100546)
>> [  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading
>> 0x1080200)
>> [  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
>> [  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10
>> (reading 0xe0610004)
>> [  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
>> [  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
>> [  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
>> [  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
>> [  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
>> [  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
>> [  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c
>> (reading 0xa80a144d)
>> [  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30
>> (reading 0xe0600000)
>> [  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
>> [  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
>> [  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
>> [  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
>> [  217.520625] nvme nvme10: 63/0/0 default/read/poll queues