CONFIG_PCIEASPM breaks PCIe on Marvell Armada 385 machine

David Daney ddaney at caviumnetworks.com
Tue Jan 17 15:37:10 PST 2017


On 01/17/2017 02:22 PM, Bjorn Helgaas wrote:
> [+cc David]
>
> On Tue, Jan 17, 2017 at 09:02:58PM +0000, Russell King - ARM Linux wrote:
>> On Tue, Jan 17, 2017 at 07:34:14PM +0000, Russell King - ARM Linux wrote:
>>> Uwe, can you try:
>>>
>>> setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
>>> 	0x50.w=0x60
>>>
>>> and see whether it remains alive (you can check by reading the root
>>> register 0x52.w - bit 12 should be set once bit 11 clears again.
>>
>> For reference, this I got wrong...
>>
>> 0xf1041a04 bit 0 indicates link status (0 = link up, 1 = link down).
>>
>>> If that's successful, maybe setting the common clock bit on the PCIe
>>> device is what's causing the problem, in which case:
>>>
>>> setpci -s 02:00.0 0x80.w=0x40
>>> setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
>>> 	0x50.w=0x60
>>
>> Having worked with Uwe over IRC, it seems that any request to retrain
>> causes the link to go down, either with or without the common clock bit
>> set:
>>
>> # setpci -s 2.0 0x50.w=0x60
>> # setpci -s 2.0 0x52.w
>> 0011
>> # memtool md 0xf1041a04+4
>> f1041a04: 00010201
>> ... reboot ...
>> # setpci -s 2.0 0x50.w=0x20
>> # memtool md 0xf1041a04+4
>> f1041a04: 00010201
>>
>> which doesn't point towards ASPM itself, but the problem is caused by
>> a side effect of ASPM's setup code which always triggers a retrain.
>>
>> Bit 5 in that register is documented (at least in the Armada 370 docs
>> and Armada XP docs I have) as:
>>
>> 5  RetrnLnk  RW    Retrain Link
>>              0x0   This bit forces the device to initiate link retraining.
>>                    Always returns 0 when read.
>>                    NOTE: If configured as an Endpoint, this field is
>>                    reserved and has no effect.
>>
>> Bjorn, are you aware of similar situations where a request for the PCIe
>> link to be retrained causes it to fail?


Link (re)training can fail for several reasons including, but not 
limited to:

- Poor signal propagation through the chips/packages/boards/connectors, 
also known as Signal Integrity (SI) problmes.

- Incorrect implementation, in hardware, of link training protocols at 
either end of the link

Usually, system and PCIe device vendors do a lot of testing and signal 
analysis across a variety of configurations with the end goal being that 
PCIe looks like a bullet-proof interconnect to the end consumer.

Unfortunatly, sometimes it doesn't work.  In these cases, the vendors of 
the devices on each end of the link tend to point fingers at the link 
partner for being detective in some way.

This patch:

>
> The only one that comes to mind is this patch from David (CC'd) that
> avoids ASPM-related retrains when we know the link doesn't support ASPM:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e53f9a28bee3
>

Is an attempt to work around the problem from the system (host) end.  If 
the system vendor knows a priori that a defective PCIe device is present 
in the system, the PCIe root port can be configured to indicate no ASPM 
is supported, resulting (with the patch) in no link retraining being 
attempted.

To me it feels that we need a black list of devices that fail at a high 
rate in the link retraining, that when encountered would disable ASPM on 
the link where they reside.

Just my $0.02
David Daney


> Side note: it looks like we don't use the recommended retrain
> algorithm in the implementation note about avoiding race conditions in
> PCIe r3.0, sec 7.8.7.
>




More information about the linux-arm-kernel mailing list