[PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled
Sinan Kaya
okaya at codeaurora.org
Thu Nov 16 12:52:47 PST 2017
>>
>> Whether the AER driver reads ~0 or not really depends on timing. The
>> link may come up from the DPC driver by the time AER driver reaches
>> here as an example.
>>
>> Bad things do happen. We have seen this with e1000e driver.
>
> I don't doubt that bad things happen. I'm just trying to understand
> exactly *what* bad things happen and how, so we can fix them cleanly.
>
This was random crashes in the e1000e drivers accompanied with stack
traces coming from WARN and msi allocation routines.
> I don't know exactly what you mean by "DPC stops the drivers
> immediately". Since the DPC hardware disables the Link, I *think*
> you probably mean that driver accesses to the device start failing
> (whether the driver notices this is a whole different question).
Sorry for not being clear.
I was talking about the pci_stop_and_remove_bus_device() call in DPC's
interrupt_event_handler() function as the "stop command".
>
> When the DPC hardware disables the Link, it causes a hot reset for
> downstream components. The DPC interrupt_event_handler() doesn't do
> much except remove the device (which detaches the driver) and clear
> the DPC Trigger Status bit (which allows hardware to try to retrain
> the Link).
>
That's true.
> So the "stop" and "recover" commands you mention must be related to
> AER.
I was talking about pci_stop_and_remove_bus_device() vs. error_detected()
as "stop" and "recover"
> I guess these would be some of the driver callbacks
> (error_detected(), mmio_enabled(), slot_reset(), reset_prepare(),
> reset_done(), resume())?
>
> In any case, I agree that it probably doesn't make sense to call any
> of these callbacks if the DPC driver has already detached the driver
> and re-attached it. The device state is gone because of the hot reset
> and the driver state is gone because of the detach/re-attach.
>
> However, I'm not so sure about the period *before* the DPC driver
> detaches the driver. The description of error_detected() says it
> cannot assume the device is accessible, so I think there might be an
> argument that AER *should* call this for DPC events so the driver has
> a chance to clean up before being unceremoniously detached.
Makes sense. Let us work on this.
>
> I suspect this all probably requires tighter integration between DPC
> and AER, and I'm totally fine with that. I think the current
> separation as separate "drivers" is pretty artificial anyway.
Got it. We will try to plumb DPC error handling into AER driver's error
handling mechanism.
Another question:
What do you think about the rescan following link up? The only entity
that does rescan today is hotplug after DPC recovery. There could be
a platform with DPC support but no hotplug support.
How should we handle it?
We can call rescan from DPC all the time.
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
More information about the linux-arm-kernel
mailing list