[REGRESSION] Keystone PCI driver probing and SerDes PLL timeout
Siddharth Vadapalli
s-vadapalli at ti.com
Sun Jan 21 22:13:47 PST 2024
On 22/01/24 11:22, Kishon Vijay Abraham I wrote:
> +Siddharth
>
> Hi Diogo,
>
> On 1/12/2024 5:16 PM, Diogo Ivo wrote:
>>
>> On 1/12/24 07:57, Greg KH wrote:
>>> On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote:
>>>> Hello,
>>>>
>>>> When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1
>>>> we came across a breakage in the probing of the Keystone PCI driver
>>>> (drivers/phy/ti/pci-keystone.c). This probing was working correctly
>>>> in the previous version we were using, v5.10.
>>>>
>>>> In order to debug this we changed over to mainline Linux and bissecting
>>>> lead us to find that commit e611f8cd8717 is the culprit, and with it applied
>>>> we get the following messages:
>>>>
>>>> [ 10.954597] phy-am654 910000.serdes: Failed to enable PLL
>>>> [ 10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110
>>>> [ 10.967485] keystone-pcie 5500000.pcie: failed to enable phy
>>>> [ 10.973560] keystone-pcie: probe of 5500000.pcie failed with error -110
>>>>
>>>> This timeout is occuring in serdes_am654_enable_pll(), called from the
>>>> phy_ops .power_on() hook.
>>>>
>>>> Due to the nature of the error messages and the contents of the commit we
>>>> believe that this is due to an unidentified race condition in the probing of
>>>> the Keystone PCI driver when enabling the PHY PLLs, since changes in the
>>>> workqueue the deferred probing runs on should not affect if probing works
>>>> or not. To further support the existence of a race condition, commit
>>>> 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely unintentionally
>>>> meaning that the problem may arise in the future again.
>>>>
>>>> One possible explanation is that there are pre-requisites for enabling the PLL
>>>> that are not being met when e611f8cd8717 is applied; to see if this is the case
>>>> help from people more familiar with the hardware details would be useful.
>>>>
>>>> As official support specifically for the IOT2050 Advanced M.2 platform was
>>>> introduced in Linux v6.3 (so in the middle of the commits mentioned above)
>>>> all of our testing was done with the latest mainline DeviceTree with [1]
>>>> applied on top.
>>>>
>>>> This is being reported as a regression even though technically things are
>>>> working with the current state of mainline since we believe the current fix
>>>> to be an unintended by-product of other work.
>>>>
>>>> #regzbot introduced: e611f8cd8717
>>> A "regression" for a commit that was in 5.13, i.e. almost 2 years ago,
>>> is a bit tough, and not something I would consider really a "regression"
>>> as it is core code that everyone runs. Given you point at scheduler
>>> changes also fixing the issue, this seems like a hint as to what is
>>> wrong with your driver/platform, but is not the root cause of it and
>>> needs to be resolved. Please look at fixing it in your drivers? Are
>>> they all in Linus's tree?
>>>
>>> thanks,
>>>
>>> greg k-h
>> Hello,
>>
>> I see the point that this code has been living in the kernel for a
>> long time now and that it becomes more difficult to justify it as
>> a regression; I reported it as such based on the supposition that
>> the current fix is not the proper one and that technically this
>> support was broken between the identified commits.
>>
>> If this situation is incompatible with a regression report then it
>> can be dropped as one and we keep it is as a bug report for which
>> we are looking for input from the community.
>>
>> I agree that this needs to be fixed in the driver since all other
>> drivers are working fine with e611f8cd8717, and yes, all of the
>> drivers in question are in mainline, where we performed the bissection.
>
> Looks like Siddharth from TI fixed a similar issue reported by you here.
> https://lore.kernel.org/r/20230927041845.1222080-1-s-vadapalli@ti.com
Kishon,
Thank you for looping me in.
Diogo,
The issue you are referring to is identical to the one fixed in the patch shared
above by Kishon. I had also bisected the culprit to commit e611f8cd8717 which
doesn't have anything to do with PCIe/Serdes in particular. It only seems to
expose the underlying race condition which has always existed. The fix has been
merged and is now a part of mainline Linux:
https://github.com/torvalds/linux/commit/c12ca110c613a81cb0f0099019c839d078cd0f38
Additionally, you might run into an issue once you fix the above which happens
to be a 45 second delay when no Endpoint Device is connected to the PCIe
connector. I have posted a patch for fixing that issue as well:
https://lore.kernel.org/r/20231019081330.2975470-1-s-vadapalli@ti.com/
--
Regards,
Siddharth.
More information about the linux-phy
mailing list