PCI trouble on mvebu (Turris Omnia)

Bjorn Helgaas helgaas at kernel.org
Thu Oct 29 15:30:22 EDT 2020


On Thu, Oct 29, 2020 at 12:12:21PM +0100, Toke Høiland-Jørgensen wrote:
> Pali Rohár <pali at kernel.org> writes:

> > I have been testing mainline kernel on Turris Omnia with two PCIe
> > default cards (WLE200 and WLE900) and it worked fine. But I do not know
> > if I had ASPM enabled or not.
> >
> > So it is working fine for you when CONFIG_PCIEASPM is disabled and whole
> > issue is only when CONFIG_PCIEASPM is enabled?
> 
> Yup, exactly. And I'm also currently testing with the default WLE200/900
> cards... I just tried sticking an MT76-based WiFi card into the third
> PCI slot, and that doesn't come up either when I enable PCIEASPM.

Huh.  So IIUC, the following cases all try to retrain the link and it
fails to come up again:

  - aardvark + WLE900VX (see commit 43fc679ced18)
  - mvebu + WLE200
  - mvebu + WLE900
  - mvebu + MT76

In all these cases, Linux was able to enumerate the NIC, which means
the link was up when firmware handed it off.

I think Linux decided the Common Clock Configuration was wrong, so it
tried to fix it and retrain the link, and the link didn't come back
up.

I don't have "lspci -vv" output from all of them, but in vtolkm's
case, the firmware handed off with:

  00:02.0 Root Port to [bus 02]  SlotClk+ CommClk+
  02:00.0 QCA986x/988x NIC       SlotClk+ CommClk-

Per spec (PCIe r5, sec 7.5.3.7), SlotClk is HwInit and CommClk is RW
and should power up as 0.  If I'm reading the implementation note
correctly, if SlotClk is set on both ends of the link, software should
set CommClk, so the config above *does* look wrong, and CommClk+ on
the Root Port suggests that firmware set it.

I think both the aardvark and mvebu systems probably use U-Boot.  I
don't know U-Boot at all, but I don't see anything in it that touches
Link Control.  I'm curious what happens if you put one of these cards
in a PC.  If anybody tries it, please collect the "sudo lspci -vv" and
dmesg output.

We could quirk these NICs to avoid the retrain, but since aardvark and
mvebu have no obvious connection and WLE200/WLE900 and MT76 have no
obvious connection, I doubt there's a simple hardware defect that
explains all these.  

Maybe we're doing something wrong in the retrain, but obviously the
link came up in the first place.  AFAIK the only thing we're changing
is the CommClk setting, and that looks legitimate per spec.

Another experiment: build kernel without CONFIG_PCIEASPM, set $ROOT
and $NIC appropriately, and try the following:

  # Set $ROOT and $NIC (update to match your system):

    # ROOT=00:02.0
    # NIC=02:00.0

  # Dump the Root Port and NIC Link registers:

    # setpci -s$ROOT CAP_EXP+0xc.l              # Link Capabilities
    # setpci -s$ROOT CAP_EXP+0x10.w             # Link Control
    # setpci -s$ROOT CAP_EXP+0x12.w             # Link Status

    # setpci -s$NIC  CAP_EXP+0xc.l              # Link Capabilities
    # setpci -s$NIC  CAP_EXP+0x10.w             # Link Control
    # setpci -s$NIC  CAP_EXP+0x12.w             # Link Status

  # Retrain the link:

    # setpci -s$ROOT CAP_EXP+0x10.w=0x0020      # Link Control Retrain Link
    # sleep 1
    # setpci -s$ROOT CAP_EXP+0x12.w             # Link Status
    # setpci -s$NIC  CAP_EXP+0x12.w             # Link Status

  # Set CommClk+ and retrain the link:

    # setpci -s$NIC  CAP_EXP+0x10.w=0x0040      # Link Control Common Clock
    # setpci -s$ROOT CAP_EXP+0x10.w=0x0040      # Link Control Common Clock
    # setpci -s$ROOT CAP_EXP+0x10.w=0x0060      # Link Control RL + CC
    # sleep 1
    # setpci -s$ROOT CAP_EXP+0x12.w             # Link Status
    # setpci -s$NIC  CAP_EXP+0x12.w             # Link Status



More information about the linux-arm-kernel mailing list