[REGRESSION] Keystone PCI driver probing and SerDes PLL timeout
Kishon Vijay Abraham I
kvijayab at amd.com
Sun Jan 21 21:52:20 PST 2024
+Siddharth
Hi Diogo,
On 1/12/2024 5:16 PM, Diogo Ivo wrote:
>
> On 1/12/24 07:57, Greg KH wrote:
>> On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote:
>>> Hello,
>>>
>>> When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1
>>> we came across a breakage in the probing of the Keystone PCI driver
>>> (drivers/phy/ti/pci-keystone.c). This probing was working correctly
>>> in the previous version we were using, v5.10.
>>>
>>> In order to debug this we changed over to mainline Linux and bissecting
>>> lead us to find that commit e611f8cd8717 is the culprit, and with it
>>> applied
>>> we get the following messages:
>>>
>>> [ 10.954597] phy-am654 910000.serdes: Failed to enable PLL
>>> [ 10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110
>>> [ 10.967485] keystone-pcie 5500000.pcie: failed to enable phy
>>> [ 10.973560] keystone-pcie: probe of 5500000.pcie failed with error
>>> -110
>>>
>>> This timeout is occuring in serdes_am654_enable_pll(), called from the
>>> phy_ops .power_on() hook.
>>>
>>> Due to the nature of the error messages and the contents of the
>>> commit we
>>> believe that this is due to an unidentified race condition in the
>>> probing of
>>> the Keystone PCI driver when enabling the PHY PLLs, since changes in the
>>> workqueue the deferred probing runs on should not affect if probing
>>> works
>>> or not. To further support the existence of a race condition, commit
>>> 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely
>>> unintentionally
>>> meaning that the problem may arise in the future again.
>>>
>>> One possible explanation is that there are pre-requisites for
>>> enabling the PLL
>>> that are not being met when e611f8cd8717 is applied; to see if this
>>> is the case
>>> help from people more familiar with the hardware details would be
>>> useful.
>>>
>>> As official support specifically for the IOT2050 Advanced M.2
>>> platform was
>>> introduced in Linux v6.3 (so in the middle of the commits mentioned
>>> above)
>>> all of our testing was done with the latest mainline DeviceTree with [1]
>>> applied on top.
>>>
>>> This is being reported as a regression even though technically things
>>> are
>>> working with the current state of mainline since we believe the
>>> current fix
>>> to be an unintended by-product of other work.
>>>
>>> #regzbot introduced: e611f8cd8717
>> A "regression" for a commit that was in 5.13, i.e. almost 2 years ago,
>> is a bit tough, and not something I would consider really a "regression"
>> as it is core code that everyone runs. Given you point at scheduler
>> changes also fixing the issue, this seems like a hint as to what is
>> wrong with your driver/platform, but is not the root cause of it and
>> needs to be resolved. Please look at fixing it in your drivers? Are
>> they all in Linus's tree?
>>
>> thanks,
>>
>> greg k-h
> Hello,
>
> I see the point that this code has been living in the kernel for a
> long time now and that it becomes more difficult to justify it as
> a regression; I reported it as such based on the supposition that
> the current fix is not the proper one and that technically this
> support was broken between the identified commits.
>
> If this situation is incompatible with a regression report then it
> can be dropped as one and we keep it is as a bug report for which
> we are looking for input from the community.
>
> I agree that this needs to be fixed in the driver since all other
> drivers are working fine with e611f8cd8717, and yes, all of the
> drivers in question are in mainline, where we performed the bissection.
Looks like Siddharth from TI fixed a similar issue reported by you here.
https://lore.kernel.org/r/20230927041845.1222080-1-s-vadapalli@ti.com
Thanks,
Kishon
More information about the linux-phy
mailing list