Clocks vs. Power Domains circular dependency

Wed Jun 19 01:41:01 PDT 2024

Hello,

miquel.raynal at bootlin.com wrote on Mon, 27 May 2024 18:19:28 +0200:

> Hello all,
> 
> I would like to re-open the clocks vs. runtime PM/power domains topic
> that is currently still problematic on i.MX8. I am mentioning a
> specific SoC here but the problem is hardware agnostic and any SoC with
> a sufficiently advanced power scheme may suffer from the same issues.
> When (mainline) Linux is running, we get lockdep warnings because of
> circular dependencies between the clock and power domain subsystems, as
> well as real deadlocks.
> 
> I don't have the full history but here is an interesting attempt at
> fixing this by Marek end of 2022:
> https://lore.kernel.org/all/Y1cs++TV2GCuh4tS@pendragon.ideasonboard.com/
> And here is an interesting discussion that happened on the mailing list
> following a second attempt from Marek as well:
> https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1-marex@denx.de
> 
> Looking more closely at this problem, there is indeed a software design
> issue. Both the clock and power domain subsystems are recursive and
> interdependent, the power domain subsystem being sometimes accessed
> through runtime PM calls.
> 
> [TLDR: I'm drafting something, but issue #4 at the bottom is blocking]

I understand getting involved in this thread is a bit tiresome :-), I'd
really like to get some feedback from the interested parties, even if
the feedback is not on the technical part, if this issue could get a
bit of support it would already be a plus.

Also, I am really annoyed by the issue mentioned below, in case someone
wants^W has the willingness to look into it. In any case I might try
to go further myself.

Thanks!
Miquèl

(keeping the context below, just in case)

> ** THE ISSUE: FROM A CLOCK PERSPECTIVE **
> 
> Clock changes (not only clk_prepare(), but getting/setting the rates as
> well, re-parenting and so on) involve runtime resuming part of the clock
> tree. Eg. registers for accessing/configuring a clock might be in a
> power domain that must be resumed.
> 
> Also, the clk subsystem is mostly written using recursivity, so for
> instance preparing a clock works like this:
> 
> /*
>  * Same applies to the other clk consumer calls, but instead of going
>  * through the parents, other parts of the clk tree are accessed (the
>  * children subtree, the new parent tree, or both). So the prepare
>  * situation is one of the simplest in this regard, but the following
>  * basically applies to most consumer operations.
>  */
> clk_prepare(clk) {
> 	clk_prepare(clk->parent);
> 	runtime_resume(clk->domain);
> 	clk->ops->prepare();
> }
> 
> Now if we add another level of complexity, accesses to the clock tree
> are serialized by the clk_prepare mutex, and runtime PM calls are also
> protected by a global genpd lock:
> 
> clk_prepare(clk) {
> 	mutex_lock(clk_prepare);
> 	clk_prepare(clk->parent); /* clk_prepare is reentrant, so this is "fine" */
> 	runtime_resume(clk->domain);
> 	clk->ops->prepare();
> 	mutex_unlock(clk_prepare);
> }
> 
> runtime_resume(domain) {
> 	mutex_lock(genpd);
> 	domain->ops->resume();
> 	mutex_unlock(genpd);
> }
> 
> If we list the locking events in order, we will:
> - acquire the clk lock
> - acquire the genPD lock
> - release the genPD lock
> - release the clk lock
> 
> ** THE ISSUE: FROM A POWER DOMAIN PERSPECTIVE **
> 
> Now let's see the other side, from the power domain perspective. We can
> imagine several situations where these locks are acquired in the reverse
> order. From a hardware standpoint there is a prerequisite (met in the
> i.MX8 in several places): a power domain is fed by a clock, hence
> enabling the power domain also implies enabling the feeding clock.
> 
>     #1: Probing/enabling peripherals involve attaching and powering up
>     their power domain. If the power domain is fed by a clock the power
>     domain implementation will request clock changes.
> 
>     #2: Suspending devices happens asynchronously in a
>     worker. Suspending a device implies turning off its power domain,
>     and hence possibly reaching some clock internals as well.
> 
> In these cases we will play with locks in this order:
> - acquire the genPD lock
> - acquire the clk lock
> - release the clk lock
> - release the genPD lock
> 
> Here are two typical examples: power domains playing with upstream
> clocks:
> https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/imx8mp-blk-ctrl.c#L538
> https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/gpcv2.c#L340
> 
> ** CURRENT UNDERSTANDING **
> 
> Clock changes and runtime PM calls can happen at any moment and if your
> hardware is in a bad mood a dead lock *will* occur. I myself only
> experienced deadlocks involving the clock worker disabling unused clocks
> at the end of the boot, but chances are way too high that systems will
> lock up at some point in production environment because of this circular
> dependency between the clock and power domain locks. It is possible to
> avoid the deadlock I experienced by disabling the clk_disable_unused
> worker and keeping all clocks enabled, but we will all agree that this
> is a very sub-optimal and extremely unsatisfying solution for a SoC
> explicitly designed for optimized power consumptions.
> 
> There is no magic, we must dismangle the situation and prevent either
> one lock ordering or the other from happening. So we should ask
> ourselves: what lock order seems the most legitimate?
> 
>     #1: Do clocks really depend on power domains?
> 
>     -> Clock controllers may be part of a power domain, hence the power  
>     domain must be enabled in order to access the clock controller
>     registers.
> 
>     -> Same applies if a clock is generated by an external device  
>     (eg. an audio codec) and this codec is accessible through a serial
>     bus: the bus controller must be resumed in order to configure the
>     code, and the bus controller may be in its own power domain.
> 
>     #2: Do power domains really depend on clocks?
> 
>     -> Somehow yes. Most power domains are not explicitly "driven" by  
>     the upstream clocks described in device trees, ie: keeping these
>     clocks gated does not prevent the use nor the access to the power
>     domains per-se. However, the power domains may redistribute the
>     clocks to the peripherals inside and somehow becoe themselves clock
>     controllers (even though so far they are not flagged like that in
>     any DTS). Hence it appears legitimate that they might want to play
>     with the clock API.
> 
>     DTSI example:
>     https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L1799
> 
>     TRM extract:
> 
>       13.2.1 Overview
>       The MEDIAMIX domain contains control and status registers known as
>       Media Block Control (MEDIA BLK_CTRL). These registers have
>       miscellaneous top-level controls for peripherals within the MEDIAMIX
>       domain. Some of these controls includes clocking, [...]
> 
>       13.2.2: Clocks and Resets
>       The Media BLK_CTRL contains clock distribution and gating controls,
>       as well as reset handling to several of the MEDIAMIX peripherals.
> 
>     -> One specificity of the i.MX8 is that many power domains are  
>     also described with a "clocks" property which list clocks that are
>     directly connected to the peripherals inside the power domain
>     instead. The reason for this inaccurate description is: toggling a
>     power domain on or off seems to involve propagating reset signals,
>     which only happens correctly if a few specific clocks feeding the
>     targeted peripherals are actually running. Then, as part of the
>     on/off sequence, it is required that these clocks get enabled (and
>     then disabled) earlier than in their respective probe functions,
>     see:
>     - ae1d2add26a4 ("soc: imx: gpcv2: handle reset clocks")
>     - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L795
>     - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L857
> 
> It is hard to tell, without more details from NXP, how difficult it
> would be to switch from the #2 "wrong" power domain DTS description to
> something more accurate and still get the proper init sequence. Although
> maybe doing that would not be very relevant as there are anyway real
> clock dependencies from some power domains, as expressed in the SoC
> manual quoted above.
> 
> #1 however, despite being a physical constraint, seems also tightly
> related to the software design: runtime PM calls were introduced in the
> clock subsystem initially for two reasons:
> - The clock is itself in an area of the SoC that needs to be powered on
>   before accessing its registers.
> - The above situation also applies to all the possible parents of the
>   clock.
> See: 9a34b45397e5 ("clk: Add support for runtime PM")
> https://lore.kernel.org/all/1503302703-13801-2-git-send-email-m.szyprowski@samsung.com
> The clock core being highly recursive, it was way easier (and also
> probably cleaner) to include nested runtime PM calls like this. But in
> practice, we could imagine moving the resume path outside of the clock
> lock and avoid "by software" the conditions under which the two locks
> are taken.
> 
> Note: These suggestions have already been made by Ulf:
> https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1-marex@denx.de/#25092554
> 
> ** APPROACH UNDER EVALUATION **
> 
> As hinted above, maybe we should clarify the conditions under which RPM
> calls can be made in the clock subsystem, and instead of including the
> RPM calls within the clock sub-calls, we could consider, on a
> case-by-case basis:
> - listing what needs to be resumed for the whole clock operation to
>   succeed first,
> - then runtime resume the needed domains outside of the main subsystem
>   lock
> - before continuing with the clock operations and eventually acquiring
>   the main clk_prepare lock.
> IOW, avoiding the runtime PM calls with the subsystem lock acquired.
> 
> Let's imagine the following situation: clkA (in power domain A) is
> parent of clkB (in power domain B). Like above, let's try to picture the
> call stack:
> 
>  clk_prepare(clkB) {
>  	mutex_lock(clk_prepare);
>  	clkA = clkB->parent;
>  	clk_prepare(clkA) {
>  		mutex_lock(clk_prepare); /* fine, it's recursive */
>  		runtime_resume(clkA->domain); // BANG
>  		clkA->ops->prepare();
>  		mutex_unlock(clk_prepare); /* does nothing */
>  	}
>  	runtime_resume(clkB->domain); // BANG again
>  	clkB->ops->prepare();
>  	mutex_unlock(clk_prepare);
>  }
> 
> Now, let's imagine we know what clocks are going to be accessed and thus
> need to be runtime resumed. Maybe we could do that in advance. The logic
> could become:
> 
>  clk_prepare(clkB) {
> +	list = list_useful_clocks_for_prepare(clkB);
> +	runtime_resume(list);
>  	mutex_lock(clk_prepare);
>  	clkA = clkB->parent;
>  	clk_prepare(clkA) {
>  		mutex_lock(clk_prepare); /* fine, it's recursive */
> -		runtime_resume(clkA->domain);
>  		clkA->ops->prepare();
>  		mutex_unlock(clk_prepare); /* does nothing */
>  	}
> -	runtime_resume(clkB->domain);
>  	clkB->ops->prepare();
>  	mutex_unlock(clk_prepare);
>  }
> 
> In the above example, the "list_useful_clocks_for_prepare()" is a bit
> mysterious and has two aspects: first it needs to guess what clocks are
> needed: this is addressed in the next paragraph. And second, we need to
> make sure the tree does not change until the lock is acquired, this is
> addressed in the "#1: Clock tree changes" chapter below.
> 
> So, regarding the list itself, any operation on any clock in the tree
> involves the said clock plus one or several clock "lists" among the
> following choices:
> 
> A- All its parents, through the clk_core->parent link.
>    => Typically during prepare/unprepare operations.  
> 
> B- All its children and sub children, through all the sub
>    clk_core->children hlist.
>    => Typically during rate operations (get and set, see note below).  
> 
> C- Parents and children of the new parent.
>    => In case of re-parenting, explicitly asked or as a result of a rate  
>    change.
> 
> Side note: the clock core does not always remember the rate of the
> clocks that were already configured. TBH I did not understand why,
> because it seems trivial to do. Instead the core physically accesses the
> clock controller which involves a lot of runtime PM handling. In
> practice, this is not always the case but some clock controller drivers
> request to be "asked" to read the actual rate each time. Problem is,
> they re-calculate the rate and then also propagate the result across all
> the children, which involves massive amounts of (IMHO) useless
> wake-ups. This honestly feels a bit sub-optimal, without understanding
> better why this is for.
> 
> ** DRAWBACKS OF THIS APPROACH **
> 
> #1: Clock tree changes
> 
> Of course there was a point in making the runtime resume calls inside
> the clk_prepare lock: the lock protects against clk tree changes. But
> somehow this is still manageable, we could:
> - acquire the lock
> - parse the list of relevant clocks/reference counting them and save
>   them in a list
> - release the lock
> - runtime resume the clocks in the list
> - acquire the lock again
> - compare the actual tree with the list and try again the whole
>   operation if something changed.
> 
> Estimated status: manageable.
> 
> #2: Re-Parenting
> 
> When performing almost any clock operation, we already know in advance
> what clocks in the tree will be impacted and thus can precisely guess
> what needs to be runtime resumed in advance. But this is not working
> when re-parenting is involved, as the re-parenting opportunity is only
> tried very deeply in the call stack far beyond acquiring the main lock.
> 
> One idea to workaround the situation was to somehow save the new parent
> handle, propagate a specific error up through the call stack (typically,
> -EAGAIN), up to the point where we can process this error, perform a new
> round of runtime resume calls over the new parent tree and retry the
> operation.
> 
> Estimated status: manageable.
> 
> #3: Debug dumps might slightly get out of sync
> 
> There are a couple of situations where all the clocks (or all the orphan
> clocks, or the root clocks) will be "shown/dumped" by the user. Problem
> is: it is fine to show all the parents/children of a single clock
> (typically all the children of one root clock), but involve a lot of
> code duplication to cover eg. all the root clocks at the same time. If
> we decide that this is okay, we can iterate over these root clocks,
> releasing the lock between them, which is obviously racy.
> 
> Estimated status: probably manageable, maybe not so important.
> 
> #4: Clock core recursivity and unrelated clocks
> 
> This is the main drawback with this approach, and that is what
> motivated this letter in the first place.
> 
> The clock core is recursive. At first I thought it was useful to
> simplify the locking mechanism inside the core itself, in particular
> because the core is mainly built off recursive functions, but there are
> two actual use cases that are much harder to solve. They are mentioned
> in the commit introducing the reentrance mechanism:
> 533ddeb1e86f ("clk: allow reentrant calls into the clk framework")
> 
> Basically, besides simplifying the locking mechanism in the core itself:
> 
> A: Enabling a clock may require enabling a "totally unrelated" peripheral
>    first. Typically, an I2C audio codec that would need to be configured
>    in order to produce a clock. In the ->prepare() hook implementation,
>    one can expect some kind of i2c_xfer() call, which in turns will
>    runtime resume the I2C controller used to access the codec. So we end
>    up acquiring again the power domain lock with the clk_prepare lock
>    taken.
> 
> B: There is one clock driver (tegra/clk-dfll.c) which performs pinctrl
>    changes inside clock operations. If the platform is runtime PM
>    enabled, the pin controller may require being runtime resumed and
>    once again we might fall into the same recursivity issue.
> 
> I don't see any positive outcome in trying to guess nor discover these
> cases dynamically and if we want to address that I believe some kind of
> static declarations must be done by the clock controller drivers (and
> trying to do that now will inevitably lead to regressions if we miss
> a single case).
> 
> Estimated status: problematic.
> 
> ** CONCLUSION AND QUESTIONS **
> 
> I've actually drafted the idea explained above. I need more work to get
> it up to a working state, as the whole clock core must be converted
> to this logic in order for it to work. I believe this approach is
> relevant, but it still has blind spots we need to figure out (#4 above)
> for which I would like to get some feedback.
> 
> Maybe however the approach will not be acknowledged and another
> direction will be pointed. In this case I would be pleased to discuss it
> and see what we can do. In all cases, the i.MX8 issue is real and must
> be fixed. If we don't do it properly, soon another power-oriented SoC
> will also show similar defects. Basically any Linux based operating
> system running on hardware featuring similarities is subject to deadlock
> at any moment.
> 
> I hope this report will be precise enough and raise ideas to help going
> forward and finally address this issue that has been opened since
> 2022 (at least) in a rather satisfying way.
> 
> Thanks,
> Miquèl
> 
> PS: There are maybe other ways. Of course I had to pick the one that
> felt the more promising, even though I initially considered a couple
> other alternatives:
> * Addressing the issue in the power domain subsystem ? Seemed less
>   relevant than in the clock core, because of the physical
>   dependencies. But please convince me :-)
> * I saw comments about changing the clk_prepare lock semantics, but we
>   will always need to serialize accessed to the clock tree and after
>   digging a bit in the clock core I don't see any way out with this
>   solution.
> * Preventing power domains from gathering clocks? Also a possible
>   approach maybe, even though the situation listed above are far from
>   straightforward to address in this case.