Why do we check for "link-up" in *_pcie_valid_device()?

Mon Jan 8 03:03:34 PST 2018

Am Freitag, den 05.01.2018, 15:43 +0000 schrieb Lorenzo Pieralisi:
> On Fri, Jan 05, 2018 at 02:26:34PM +0000, Bharat Kumar Gogada wrote:
> > On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada
> > wrote:
> > > Bjorn wrote:
> > > > In the PCI config access path, the *_pcie_valid_device()
> > > > functions in 
> > > > the dwc, altera, rockchip, and xilinx drivers all check whether
> > > > the 
> > > > link is up.
> > > > 
> > > > I think this is racy because the link may go down after we
> > > > check but 
> > > > before we perform the config access.
> > > > 
> > > > What would blow up if we removed the *_pcie_link_up() checks?
> > > > 
> > > > I'd like to either remove the checks or add comments about why
> > > > the 
> > > > race is acceptable.  If we've covered this before, I apologize.
> > > > Adding a comment will keep me from pestering you about this
> > > > again in 
> > > > the future.
> > > In both Xilinx driver cases when link is down, hardware responds
> > > by 
> > > AXI DECERR/SLVERR status which causes an exception, synchronous 
> > > external abort to CPU.  This causes system to hang, so we need
> > > this 
> > > check for both of our drivers.  We will add comments.
> > 
> > This is a problem, and checking whether the link is up is a
> > workaround but not a real solution.  That means your system may
> > hang if the link happens to go down at the wrong time.
> > 
> > A real solution would be to handle the synchronous external abort
> > so it doesn't cause a system hang.
> > 
> > Yes, I agree that this is workaround. For pcie-xilinx.c for arm32,
> > we can have fault handling similar to "imx6q_pcie_abort_handler" in
> > drivers/pci/dwc/pci-imx6.c.
> > Since this driver is same for Microblaze architecture also, it
> > requires separate handling.
> > 
> > For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang
> > for the above AXI responses. 
> > As of now arm64 RAS is still work in progress [2].  
> > 
> > [1] https://www.spinics.net/lists/arm-kernel/msg624203.html
> > 
> > [2] https://patchwork.kernel.org/patch/9973967/
> > 
> > The check can be removed, if above issues were addressed.
> 
> I do not see why the above "issues" should be addressed in order to
> remove that check - as it was pointed out in this thread it just does
> not solve anything, so what's the reason for keeping it ?

I solves the issue that you hang the system on PCIe enumeration in 100%
of the cases when the link is down and you don't have the abort handler
in place.

It doesn't solve the race issue, but that is a lot less likely to be
hit in the real world. I guess it's not a good idea to remove something
that covers 98% of the problem just because it doesn't cover the
remaining 2%, right?

Regards,
Lucas