Why do we check for "link-up" in *_pcie_valid_device()?

Lorenzo Pieralisi lorenzo.pieralisi at arm.com
Fri Jan 5 07:43:03 PST 2018


On Fri, Jan 05, 2018 at 02:26:34PM +0000, Bharat Kumar Gogada wrote:
> On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote:
> > Bjorn wrote:
> >> In the PCI config access path, the *_pcie_valid_device() functions in 
> >> the dwc, altera, rockchip, and xilinx drivers all check whether the 
> >> link is up.
> >> 
> >> I think this is racy because the link may go down after we check but 
> >> before we perform the config access.
> >> 
> >> What would blow up if we removed the *_pcie_link_up() checks?
> >> 
> >> I'd like to either remove the checks or add comments about why the 
> >> race is acceptable.  If we've covered this before, I apologize.
> >> Adding a comment will keep me from pestering you about this again in 
> >> the future.
> 
> > In both Xilinx driver cases when link is down, hardware responds by 
> > AXI DECERR/SLVERR status which causes an exception, synchronous 
> > external abort to CPU.  This causes system to hang, so we need this 
> > check for both of our drivers.  We will add comments.
> 
> This is a problem, and checking whether the link is up is a workaround but not a real solution.  That means your system may hang if the link happens to go down at the wrong time.
> 
> A real solution would be to handle the synchronous external abort so it doesn't cause a system hang.
> 
> Yes, I agree that this is workaround. For pcie-xilinx.c for arm32, we can have fault handling similar to "imx6q_pcie_abort_handler" in drivers/pci/dwc/pci-imx6.c.
> Since this driver is same for Microblaze architecture also, it requires separate handling.
> 
> For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang for the above AXI responses. 
> As of now arm64 RAS is still work in progress [2].  
> 
> [1] https://www.spinics.net/lists/arm-kernel/msg624203.html
> 
> [2] https://patchwork.kernel.org/patch/9973967/
> 
> The check can be removed, if above issues were addressed.

I do not see why the above "issues" should be addressed in order to
remove that check - as it was pointed out in this thread it just does
not solve anything, so what's the reason for keeping it ?

Lorenzo



More information about the Linux-rockchip mailing list