[PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled
Bjorn Helgaas
helgaas at kernel.org
Fri Nov 17 16:02:43 PST 2017
On Thu, Nov 16, 2017 at 03:52:47PM -0500, Sinan Kaya wrote:
> >>
> >> Whether the AER driver reads ~0 or not really depends on timing. The
> >> link may come up from the DPC driver by the time AER driver reaches
> >> here as an example.
> >>
> >> Bad things do happen. We have seen this with e1000e driver.
> >
> > I don't doubt that bad things happen. I'm just trying to understand
> > exactly *what* bad things happen and how, so we can fix them cleanly.
>
> This was random crashes in the e1000e drivers accompanied with stack
> traces coming from WARN and msi allocation routines.
I didn't look in detail, but I'm not sure there's sufficient locking
in the AER path to make it safe from concurrent device removal. I
suspect AER could be improved both with respect to handling ~0 data
and this potential concurrency issue.
> > So the "stop" and "recover" commands you mention must be related to
> > AER.
>
> I was talking about pci_stop_and_remove_bus_device() vs. error_detected()
> as "stop" and "recover"
Thanks for clearing that up!
> > I suspect this all probably requires tighter integration between DPC
> > and AER, and I'm totally fine with that. I think the current
> > separation as separate "drivers" is pretty artificial anyway.
>
> Got it. We will try to plumb DPC error handling into AER driver's error
> handling mechanism.
Looking at the AER code today, I noticed it already uses "DPC" in
another sense. I don't know what it stands for there (probably
"deferred" something), but I don't think it's "Downstream Port
Containment" :)
> What do you think about the rescan following link up? The only entity
> that does rescan today is hotplug after DPC recovery. There could be
> a platform with DPC support but no hotplug support.
>
> How should we handle it?
Good question. If your system does support both DPC and hotplug, I
assume the link comes back up after you clear DPC Trigger Status.
Does pciehp notice that "link up" event and add the device back?
So I think your question is whether the DPC code should explicitly do
a rescan so that if we don't have pciehp, we'll still automatically
rediscover the device. I dunno, maybe. Seems like a plausible idea
anyway.
Bjorn
More information about the linux-arm-kernel
mailing list