CONFIG_PCIEASPM breaks PCIe on Marvell Armada 385 machine

Wed Jan 18 09:36:55 PST 2017

On 01/18/2017 06:22 AM, Bjorn Helgaas wrote:
> On Tue, Jan 17, 2017 at 03:37:10PM -0800, David Daney wrote:
[...]
>>
>>
>> Link (re)training can fail for several reasons including, but not
>> limited to:
>>
>> - Poor signal propagation through the
>> chips/packages/boards/connectors, also known as Signal Integrity
>> (SI) problmes.
>>
>> - Incorrect implementation, in hardware, of link training protocols
>> at either end of the link
>>
>> Usually, system and PCIe device vendors do a lot of testing and
>> signal analysis across a variety of configurations with the end goal
>> being that PCIe looks like a bullet-proof interconnect to the end
>> consumer.
>>
>> Unfortunatly, sometimes it doesn't work.  In these cases, the
>> vendors of the devices on each end of the link tend to point fingers
>> at the link partner for being detective in some way.
>>
>> This patch:
>>
>>>
>>> The only one that comes to mind is this patch from David (CC'd) that
>>> avoids ASPM-related retrains when we know the link doesn't support ASPM:
>>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e53f9a28bee3
>>>
>>
>> Is an attempt to work around the problem from the system (host) end.
>> If the system vendor knows a priori that a defective PCIe device is
>> present in the system, the PCIe root port can be configured to
>> indicate no ASPM is supported, resulting (with the patch) in no link
>> retraining being attempted.
>>
>> To me it feels that we need a black list of devices that fail at a
>> high rate in the link retraining, that when encountered would
>> disable ASPM on the link where they reside.
>
> I should have asked you for details about the defective devices
> related to e53f9a28bee3 :)  If we had included that in the changelog,
> we would have something to seed a blacklist with.

The device I saw failing I don't have access to any more, so I don't 
know the PCI IDs.  It was a solid-state storage device with a Xilinx 
FPGA acting as the PCIe endpoint.  In any event, it would only fail in 
about 0.5% of system boots, it wasn't the case that it could be made to 
reliably fail.

The tricky thing here is assigning the blame for failure in link 
training.  In the case in question we spent many months analysing the 
analog properties of the bus and examining/decoding  analog scope 
captures of the failures before credibly assigning blame to the other 
guy.  Usually what happens is the device vendor accurately claims that 
their device works flawlessly in conjunction with certain Intel root 
ports, so the problem must be fixed in the root port of the failing 
system.  If you have a black list, you may be disabling ASPM in systems 
where it can work without failures.

>
> There are several situations other than ASPM where link retraining is
> required per spec (rate change, error handling, etc), and I guess we'd
> have to avoid all of them.   So I suppose e53f9a28bee3 avoids the most
> obvious failures, but maybe we could still see issues in those other
> cases.
>
> Bjorn
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>