[PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

Tue Mar 14 17:00:40 PDT 2023

On 3/15/23 07:54, Damien Le Moal wrote:
> On 3/14/23 23:53, Rick Wertenbroek wrote:
>> Hello Damien,
>> I also noticed random issues I suspect to be related to link status or power
>> state, in my case it sometimes happens that the BARs (0-6) in the config
>> space get reset to 0. This is not due to the driver because the driver never
>> ever accesses these registers (@0xfd80'0010 to 0xfd80'0024 TRM
>> 17.6.4.1.5-17.6.4.1.10).
>> I don't think the host rewrites them because lspci shows the BARs as
>> "[virtual]" which means they have been assigned by host but have 0
>> value in the endpoint device (when lspci rereads the PCI config header).
>> See https://github.com/pciutils/pciutils/blob/master/lspci.c#L422
>>
>> So I suspect the controller detects something related to link status or
>> power state and internally (in hardware) resets those registers. It's not
>> the kernel code, it never accesses these regs. The problem occurs
>> very randomly, sometimes in a few seconds, sometimes I cannot see
>> it for a whole day.
>>
>> Is this similar to what you are experiencing ?
> 
> Yes. I sometimes get NMIs after starting the function driver, when my function
> driver starts probing the bar registers after seeing the host changing one
> register. And the link also comes up with 4 lanes or 2 lanes, random.
> 
>> Do you have any idea as to what could make these registers to be reset
>> (I could not find anything in the TRM, also nothing in the driver seems to
>> cause it).
> 
> My thinking is that since we do not have a linkup notifier, the function driver
> starts setting things up without the link established (e.g. when the host is
> still powered down). Once the host start booting and pic link is established,
> things may be reset in the hardware... That is the only thing I can think of.
> 
> And yes, there are definitely something going on with the power states too I
> think: if I let things idle for a few minutes, everything stops working: no
> activity seen on the endpoint over the BARs. I tried enabling the sys and client
> interrupts to see if I can see power state changes, or if clearing the
> interrupts helps (they are masked by default), but no change. And booting the
> host with pci_aspm=off does not help either. Also tried to change all the
> capabilities related to link & power states to "off" (not supported), and no
> change either. So currently, I am out of ideas regarding that one.
> 
> I am trying to make progress on my endpoint driver (nvme function) to be sure it
> is not a bug there that breaks things. I may still have something bad because
> when I enable the BIOS native NVMe driver on the host, either the host does not
> boot, or grub crashes with memory corruptions. Overall, not yet very stable and
> still trying to sort out the root cause of that.

By the way, enabling the interrupts to see the error notifications, I do see a
lot of retry timeout and other recoverable errors. So the issues I am seeing
could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
... ?). Not sure. I do not have a PCI analyzer handy :)

I attached the patches I used to enable the EP interrupts. Enabling debug prints
will tell you what is going on. That may give you some hints on your setup ?

-- 
Damien Le Moal
Western Digital Research
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-arm-rk3399-rockpro64-Add-interrupts-to-PCI-endpoint-.patch
Type: text/x-patch
Size: 1142 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-rockchip/attachments/20230315/ba38679e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-PCI-rockchip-Add-client-and-subsys-interrupts-for-en.patch
Type: text/x-patch
Size: 6301 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-rockchip/attachments/20230315/ba38679e/attachment-0001.bin>