nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Mon Mar 13 17:57:43 PDT 2023

On 3/11/23 00:22, Lukas Wunner wrote:
> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>> In the log below, pciehp obviously is enabled; should I infer that in
>>> the log above, it is not?
>>
>> pciehp is enabled all the time. In the log above and below.
>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>> link down/up) and not in others like you noticed in both the logs.
> 
> Maybe some of the switch Downstream Ports are hotplug-capable and
> some are not?  (Check the Slot Implemented bit in the PCI Express
> Capabilities Register as well as the Hot-Plug Capable bit in the
> Slot Capabilities Register.)

All PCIe switch Downstream ports where nvmes are connected are hotplug-capable
except the boot nvme drive which I would never use in my test :)
For my test, PCIe switch Downstream port is a9:10.0 where nvme ab:00.0 is 
connected as an endpoint device.

# PCI Express Cap - Slot Implemented bit(8) is set to 1
root$ setpci -v -s a9:10.0 cap_exp+0x02.w
0000:a9:10.0 (cap 10 @68) @6a = 0162

# Slt Cap - Hot-Plug Capable bit(6) is set to 1
root$ setpci -v -s a9:10.0 cap_exp+0x14.L
0000:a9:10.0 (cap 10 @68) @7c = 08800ce0

# Slt Control
root$ setpci -v -s a9:10.0 cap_exp+0x18.w
0000:a9:10.0 (cap 10 @68) @80 = 1028

# Slt Status
root$ setpci -v -s a9:10.0 cap_exp+0x1A.w
0000:a9:10.0 (cap 10 @68) @82 = 0040

FWIW, after MPS changed and issue reproduce:
root$ setpci -v -s a9:10.0 cap_exp+0x014.L
0000:a9:10.0 (cap 10 @68) @7c = 08800ce0
root$ setpci -v -s a9:10.0 cap_exp+0x018.w
0000:a9:10.0 (cap 10 @68) @80 = 1028
root$ setpci -v -s a9:10.0 cap_exp+0x01A.w
0000:a9:10.0 (cap 10 @68) @82 = 0148

> 
> 
>>> Generally we've avoided handling a device reset as a remove/add event
>>> because upper layers can't deal well with that.  But in the log below
>>> it looks like pciehp *did* treat the DPC containment as a remove/add,
>>> which of course involves configuring the "new" device and its MPS
>>> settings.
>>
>> yes and that puzzled me why? especially when"Link Down/Up ignored (recovered
>> by DPC)". Do we still have race somewhere, I am not sure.
> 
> You're seeing the expected behavior.  pciehp ignores DLLSC events
> caused by DPC, but then double-checks that DPC recovery succeeded.
> If it didn't, it would be a bug not to bring down the slot.
> So pciehp does exactly that.  See this code snippet in
> pciehp_ignore_dpc_link_change():
> 
> 	/*
> 	 * If the link is unexpectedly down after successful recovery,
> 	 * the corresponding link change may have been ignored above.
> 	 * Synthesize it to ensure that it is acted on.
> 	 */
> 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
> 	if (!pciehp_check_link_active(ctrl))
> 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> 	up_read(&ctrl->reset_lock);
> 
> So on hotplug-capable ports, pciehp is able to mop up the mess created
> by fiddling with the MPS settings behind the kernel's back.

That's the thing, even on hotplug-capable slot I do not see pciehp _all_ the
time. Sometime pciehp get involve and takes care of things (like I mentioned in
the previous thread) and other times no pciehp engagement at all!

> 
> We don't have that option on non-hotplug-capable ports.  If error
> recovery fails, we generally let the inaccessible devices remain
> in the system and user interaction is necessary to recover, either
> through a reboot or by manually removing and rescanning PCI devices
> via syfs after reinstating sane MPS settings.

Sure, that is understood for non-hotplug-capable slots.

> 
> 
>>    - Switch and NVMe MPS are 512B
>>    - NVMe config space saved (including MPS=512B)
>>    - You change Switch MPS to 128B
>>    - NVMe does DMA with payload > 128B
>>    - Switch reports Malformed TLP because TLP is larger than its MPS
>>    - Recovery resets NVMe, which sets MPS to the default of 128B
>>    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>>    - Subsequent NVMe DMA with payload > 128B repeats cycle
> 
> Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(),
> shouldn't the communication with the device just work again from that
> point on >
> Thanks,
> 
> Lukas