racing ndo_open()/phylink*connect() with phy_probe()

Russell King (Oracle) linux at armlinux.org.uk
Tue Apr 30 14:42:58 PDT 2024


On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote:
> Basically, NetworkManager is setting both interfaces to up, and end1's
> phy doesn't seem to be ready when ndo_open() runs, returning
> -ENODEV in phylink_fwnode_phy_connect() and bubbling that back up. This doesn't

Let's get something clear - you're attributing phylink to this, but this
is not the case. phylink doesn't deal directly with PHYs, it makes use
of phylib for that, and merely passes back to its caller whatever status
it gets from phylib. It's also not fair to attribute this to phylib as
we will see later...

There are a few reasons for phylink_fwnode_phy_connect() would return
-ENODEV:

1) fwnode_get_phy_node() (a phylib function) returning an error,
basically meaning the phy node isn't found. This would be a persistent
error, so unlikely to be your issue.

2) fwnode_phy_find_device() (another phylib function) not finding the
PHY device corresponding to the fwnode returned by the above on the
MDIO bus. This is possible if the PHY has not been detected on the
MDIO bus, but I suspect this is not the cause of your issue.

3) phy_attach_direct() (another phylib function) returning an error.
This function calls phy_init_hw() which will attempt to talk to the
hardware, and if that returns an error, it will be propagated up.

(3) is the most likely scenario given your quoted DT description. I
suspect that the stmmac/qcom-ethqos driver is what's at fault here.

Your DT description shows that the PHYs are on one MDIO bus owned by
one of the network interfaces. I suspect if that network interface
is down, then the MDIO bus is not accessible.

Therefore, when you attempt to open the _other_ network interface,
accesses to its PHY fail with -ENODEV and that gets propagated all
the way back up.

What's more is if you did manage to get that network interface up
(because the one with the MDIO bus on was up) then if you take
that interface down, you'll end up with a phy_error() splat from
phylib because the PHY it was using has become inaccessible.

Basically, the network driver is buggy for this PHY setup. If a
MDIO bus contains devices that are not owned by the network device
owning that MDIO bus, then the MDIO bus _must_ be prepared to handle
MDIO bus accesses _at_ _any_ _time_. This clearly is not the case
here.

It could also be the case that if the driver is using runtime PM,
that when the network interface is runtime-PM suspended, it causes
MDIO bus accesses to fail... that would be very chaotic though.

In any case, I'm going to say... I don't think this is a phylink nor
phylib issue, but a buggy network driver thinking that it has the
right to shutdown MDIO bus access depending on its network interface
state.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!



More information about the linux-arm-kernel mailing list