[REGRESSION] Keystone PCI driver probing and SerDes PLL timeout

Greg KH gregkh at linuxfoundation.org
Thu Jan 11 23:57:59 PST 2024


On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote:
> Hello,
> 
> When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1
> we came across a breakage in the probing of the Keystone PCI driver
> (drivers/phy/ti/pci-keystone.c). This probing was working correctly
> in the previous version we were using, v5.10.
> 
> In order to debug this we changed over to mainline Linux and bissecting
> lead us to find that commit e611f8cd8717 is the culprit, and with it applied
> we get the following messages:
> 
> [   10.954597] phy-am654 910000.serdes: Failed to enable PLL
> [   10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110
> [   10.967485] keystone-pcie 5500000.pcie: failed to enable phy
> [   10.973560] keystone-pcie: probe of 5500000.pcie failed with error -110
> 
> This timeout is occuring in serdes_am654_enable_pll(), called from the 
> phy_ops .power_on() hook.
> 
> Due to the nature of the error messages and the contents of the commit we
> believe that this is due to an unidentified race condition in the probing of
> the Keystone PCI driver when enabling the PHY PLLs, since changes in the
> workqueue the deferred probing runs on should not affect if probing works
> or not. To further support the existence of a race condition, commit
> 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely unintentionally
> meaning that the problem may arise in the future again.
> 
> One possible explanation is that there are pre-requisites for enabling the PLL
> that are not being met when e611f8cd8717 is applied; to see if this is the case
> help from people more familiar with the hardware details would be useful.
> 
> As official support specifically for the IOT2050 Advanced M.2 platform was
> introduced in Linux v6.3 (so in the middle of the commits mentioned above)
> all of our testing was done with the latest mainline DeviceTree with [1]
> applied on top.
> 
> This is being reported as a regression even though technically things are
> working with the current state of mainline since we believe the current fix
> to be an unintended by-product of other work.
> 
> #regzbot introduced: e611f8cd8717

A "regression" for a commit that was in 5.13, i.e. almost 2 years ago,
is a bit tough, and not something I would consider really a "regression"
as it is core code that everyone runs.  Given you point at scheduler
changes also fixing the issue, this seems like a hint as to what is
wrong with your driver/platform, but is not the root cause of it and
needs to be resolved.  Please look at fixing it in your drivers?  Are
they all in Linus's tree?

thanks,

greg k-h



More information about the linux-phy mailing list