Clocks vs. Power Domains circular dependency
Miquel Raynal
miquel.raynal at bootlin.com
Wed Jun 19 01:41:01 PDT 2024
Hello,
miquel.raynal at bootlin.com wrote on Mon, 27 May 2024 18:19:28 +0200:
> Hello all,
>
> I would like to re-open the clocks vs. runtime PM/power domains topic
> that is currently still problematic on i.MX8. I am mentioning a
> specific SoC here but the problem is hardware agnostic and any SoC with
> a sufficiently advanced power scheme may suffer from the same issues.
> When (mainline) Linux is running, we get lockdep warnings because of
> circular dependencies between the clock and power domain subsystems, as
> well as real deadlocks.
>
> I don't have the full history but here is an interesting attempt at
> fixing this by Marek end of 2022:
> https://lore.kernel.org/all/Y1cs++TV2GCuh4tS@pendragon.ideasonboard.com/
> And here is an interesting discussion that happened on the mailing list
> following a second attempt from Marek as well:
> https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1-marex@denx.de
>
> Looking more closely at this problem, there is indeed a software design
> issue. Both the clock and power domain subsystems are recursive and
> interdependent, the power domain subsystem being sometimes accessed
> through runtime PM calls.
>
> [TLDR: I'm drafting something, but issue #4 at the bottom is blocking]
I understand getting involved in this thread is a bit tiresome :-), I'd
really like to get some feedback from the interested parties, even if
the feedback is not on the technical part, if this issue could get a
bit of support it would already be a plus.
Also, I am really annoyed by the issue mentioned below, in case someone
wants^W has the willingness to look into it. In any case I might try
to go further myself.
Thanks!
Miquèl
(keeping the context below, just in case)
> ** THE ISSUE: FROM A CLOCK PERSPECTIVE **
>
> Clock changes (not only clk_prepare(), but getting/setting the rates as
> well, re-parenting and so on) involve runtime resuming part of the clock
> tree. Eg. registers for accessing/configuring a clock might be in a
> power domain that must be resumed.
>
> Also, the clk subsystem is mostly written using recursivity, so for
> instance preparing a clock works like this:
>
> /*
> * Same applies to the other clk consumer calls, but instead of going
> * through the parents, other parts of the clk tree are accessed (the
> * children subtree, the new parent tree, or both). So the prepare
> * situation is one of the simplest in this regard, but the following
> * basically applies to most consumer operations.
> */
> clk_prepare(clk) {
> clk_prepare(clk->parent);
> runtime_resume(clk->domain);
> clk->ops->prepare();
> }
>
> Now if we add another level of complexity, accesses to the clock tree
> are serialized by the clk_prepare mutex, and runtime PM calls are also
> protected by a global genpd lock:
>
> clk_prepare(clk) {
> mutex_lock(clk_prepare);
> clk_prepare(clk->parent); /* clk_prepare is reentrant, so this is "fine" */
> runtime_resume(clk->domain);
> clk->ops->prepare();
> mutex_unlock(clk_prepare);
> }
>
> runtime_resume(domain) {
> mutex_lock(genpd);
> domain->ops->resume();
> mutex_unlock(genpd);
> }
>
> If we list the locking events in order, we will:
> - acquire the clk lock
> - acquire the genPD lock
> - release the genPD lock
> - release the clk lock
>
> ** THE ISSUE: FROM A POWER DOMAIN PERSPECTIVE **
>
> Now let's see the other side, from the power domain perspective. We can
> imagine several situations where these locks are acquired in the reverse
> order. From a hardware standpoint there is a prerequisite (met in the
> i.MX8 in several places): a power domain is fed by a clock, hence
> enabling the power domain also implies enabling the feeding clock.
>
> #1: Probing/enabling peripherals involve attaching and powering up
> their power domain. If the power domain is fed by a clock the power
> domain implementation will request clock changes.
>
> #2: Suspending devices happens asynchronously in a
> worker. Suspending a device implies turning off its power domain,
> and hence possibly reaching some clock internals as well.
>
> In these cases we will play with locks in this order:
> - acquire the genPD lock
> - acquire the clk lock
> - release the clk lock
> - release the genPD lock
>
> Here are two typical examples: power domains playing with upstream
> clocks:
> https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/imx8mp-blk-ctrl.c#L538
> https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/gpcv2.c#L340
>
> ** CURRENT UNDERSTANDING **
>
> Clock changes and runtime PM calls can happen at any moment and if your
> hardware is in a bad mood a dead lock *will* occur. I myself only
> experienced deadlocks involving the clock worker disabling unused clocks
> at the end of the boot, but chances are way too high that systems will
> lock up at some point in production environment because of this circular
> dependency between the clock and power domain locks. It is possible to
> avoid the deadlock I experienced by disabling the clk_disable_unused
> worker and keeping all clocks enabled, but we will all agree that this
> is a very sub-optimal and extremely unsatisfying solution for a SoC
> explicitly designed for optimized power consumptions.
>
> There is no magic, we must dismangle the situation and prevent either
> one lock ordering or the other from happening. So we should ask
> ourselves: what lock order seems the most legitimate?
>
> #1: Do clocks really depend on power domains?
>
> -> Clock controllers may be part of a power domain, hence the power
> domain must be enabled in order to access the clock controller
> registers.
>
> -> Same applies if a clock is generated by an external device
> (eg. an audio codec) and this codec is accessible through a serial
> bus: the bus controller must be resumed in order to configure the
> code, and the bus controller may be in its own power domain.
>
> #2: Do power domains really depend on clocks?
>
> -> Somehow yes. Most power domains are not explicitly "driven" by
> the upstream clocks described in device trees, ie: keeping these
> clocks gated does not prevent the use nor the access to the power
> domains per-se. However, the power domains may redistribute the
> clocks to the peripherals inside and somehow becoe themselves clock
> controllers (even though so far they are not flagged like that in
> any DTS). Hence it appears legitimate that they might want to play
> with the clock API.
>
> DTSI example:
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L1799
>
> TRM extract:
>
> 13.2.1 Overview
> The MEDIAMIX domain contains control and status registers known as
> Media Block Control (MEDIA BLK_CTRL). These registers have
> miscellaneous top-level controls for peripherals within the MEDIAMIX
> domain. Some of these controls includes clocking, [...]
>
> 13.2.2: Clocks and Resets
> The Media BLK_CTRL contains clock distribution and gating controls,
> as well as reset handling to several of the MEDIAMIX peripherals.
>
> -> One specificity of the i.MX8 is that many power domains are
> also described with a "clocks" property which list clocks that are
> directly connected to the peripherals inside the power domain
> instead. The reason for this inaccurate description is: toggling a
> power domain on or off seems to involve propagating reset signals,
> which only happens correctly if a few specific clocks feeding the
> targeted peripherals are actually running. Then, as part of the
> on/off sequence, it is required that these clocks get enabled (and
> then disabled) earlier than in their respective probe functions,
> see:
> - ae1d2add26a4 ("soc: imx: gpcv2: handle reset clocks")
> - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L795
> - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mp.dtsi#L857
>
> It is hard to tell, without more details from NXP, how difficult it
> would be to switch from the #2 "wrong" power domain DTS description to
> something more accurate and still get the proper init sequence. Although
> maybe doing that would not be very relevant as there are anyway real
> clock dependencies from some power domains, as expressed in the SoC
> manual quoted above.
>
> #1 however, despite being a physical constraint, seems also tightly
> related to the software design: runtime PM calls were introduced in the
> clock subsystem initially for two reasons:
> - The clock is itself in an area of the SoC that needs to be powered on
> before accessing its registers.
> - The above situation also applies to all the possible parents of the
> clock.
> See: 9a34b45397e5 ("clk: Add support for runtime PM")
> https://lore.kernel.org/all/1503302703-13801-2-git-send-email-m.szyprowski@samsung.com
> The clock core being highly recursive, it was way easier (and also
> probably cleaner) to include nested runtime PM calls like this. But in
> practice, we could imagine moving the resume path outside of the clock
> lock and avoid "by software" the conditions under which the two locks
> are taken.
>
> Note: These suggestions have already been made by Ulf:
> https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1-marex@denx.de/#25092554
>
> ** APPROACH UNDER EVALUATION **
>
> As hinted above, maybe we should clarify the conditions under which RPM
> calls can be made in the clock subsystem, and instead of including the
> RPM calls within the clock sub-calls, we could consider, on a
> case-by-case basis:
> - listing what needs to be resumed for the whole clock operation to
> succeed first,
> - then runtime resume the needed domains outside of the main subsystem
> lock
> - before continuing with the clock operations and eventually acquiring
> the main clk_prepare lock.
> IOW, avoiding the runtime PM calls with the subsystem lock acquired.
>
> Let's imagine the following situation: clkA (in power domain A) is
> parent of clkB (in power domain B). Like above, let's try to picture the
> call stack:
>
> clk_prepare(clkB) {
> mutex_lock(clk_prepare);
> clkA = clkB->parent;
> clk_prepare(clkA) {
> mutex_lock(clk_prepare); /* fine, it's recursive */
> runtime_resume(clkA->domain); // BANG
> clkA->ops->prepare();
> mutex_unlock(clk_prepare); /* does nothing */
> }
> runtime_resume(clkB->domain); // BANG again
> clkB->ops->prepare();
> mutex_unlock(clk_prepare);
> }
>
> Now, let's imagine we know what clocks are going to be accessed and thus
> need to be runtime resumed. Maybe we could do that in advance. The logic
> could become:
>
> clk_prepare(clkB) {
> + list = list_useful_clocks_for_prepare(clkB);
> + runtime_resume(list);
> mutex_lock(clk_prepare);
> clkA = clkB->parent;
> clk_prepare(clkA) {
> mutex_lock(clk_prepare); /* fine, it's recursive */
> - runtime_resume(clkA->domain);
> clkA->ops->prepare();
> mutex_unlock(clk_prepare); /* does nothing */
> }
> - runtime_resume(clkB->domain);
> clkB->ops->prepare();
> mutex_unlock(clk_prepare);
> }
>
> In the above example, the "list_useful_clocks_for_prepare()" is a bit
> mysterious and has two aspects: first it needs to guess what clocks are
> needed: this is addressed in the next paragraph. And second, we need to
> make sure the tree does not change until the lock is acquired, this is
> addressed in the "#1: Clock tree changes" chapter below.
>
> So, regarding the list itself, any operation on any clock in the tree
> involves the said clock plus one or several clock "lists" among the
> following choices:
>
> A- All its parents, through the clk_core->parent link.
> => Typically during prepare/unprepare operations.
>
> B- All its children and sub children, through all the sub
> clk_core->children hlist.
> => Typically during rate operations (get and set, see note below).
>
> C- Parents and children of the new parent.
> => In case of re-parenting, explicitly asked or as a result of a rate
> change.
>
> Side note: the clock core does not always remember the rate of the
> clocks that were already configured. TBH I did not understand why,
> because it seems trivial to do. Instead the core physically accesses the
> clock controller which involves a lot of runtime PM handling. In
> practice, this is not always the case but some clock controller drivers
> request to be "asked" to read the actual rate each time. Problem is,
> they re-calculate the rate and then also propagate the result across all
> the children, which involves massive amounts of (IMHO) useless
> wake-ups. This honestly feels a bit sub-optimal, without understanding
> better why this is for.
>
> ** DRAWBACKS OF THIS APPROACH **
>
> #1: Clock tree changes
>
> Of course there was a point in making the runtime resume calls inside
> the clk_prepare lock: the lock protects against clk tree changes. But
> somehow this is still manageable, we could:
> - acquire the lock
> - parse the list of relevant clocks/reference counting them and save
> them in a list
> - release the lock
> - runtime resume the clocks in the list
> - acquire the lock again
> - compare the actual tree with the list and try again the whole
> operation if something changed.
>
> Estimated status: manageable.
>
> #2: Re-Parenting
>
> When performing almost any clock operation, we already know in advance
> what clocks in the tree will be impacted and thus can precisely guess
> what needs to be runtime resumed in advance. But this is not working
> when re-parenting is involved, as the re-parenting opportunity is only
> tried very deeply in the call stack far beyond acquiring the main lock.
>
> One idea to workaround the situation was to somehow save the new parent
> handle, propagate a specific error up through the call stack (typically,
> -EAGAIN), up to the point where we can process this error, perform a new
> round of runtime resume calls over the new parent tree and retry the
> operation.
>
> Estimated status: manageable.
>
> #3: Debug dumps might slightly get out of sync
>
> There are a couple of situations where all the clocks (or all the orphan
> clocks, or the root clocks) will be "shown/dumped" by the user. Problem
> is: it is fine to show all the parents/children of a single clock
> (typically all the children of one root clock), but involve a lot of
> code duplication to cover eg. all the root clocks at the same time. If
> we decide that this is okay, we can iterate over these root clocks,
> releasing the lock between them, which is obviously racy.
>
> Estimated status: probably manageable, maybe not so important.
>
> #4: Clock core recursivity and unrelated clocks
>
> This is the main drawback with this approach, and that is what
> motivated this letter in the first place.
>
> The clock core is recursive. At first I thought it was useful to
> simplify the locking mechanism inside the core itself, in particular
> because the core is mainly built off recursive functions, but there are
> two actual use cases that are much harder to solve. They are mentioned
> in the commit introducing the reentrance mechanism:
> 533ddeb1e86f ("clk: allow reentrant calls into the clk framework")
>
> Basically, besides simplifying the locking mechanism in the core itself:
>
> A: Enabling a clock may require enabling a "totally unrelated" peripheral
> first. Typically, an I2C audio codec that would need to be configured
> in order to produce a clock. In the ->prepare() hook implementation,
> one can expect some kind of i2c_xfer() call, which in turns will
> runtime resume the I2C controller used to access the codec. So we end
> up acquiring again the power domain lock with the clk_prepare lock
> taken.
>
> B: There is one clock driver (tegra/clk-dfll.c) which performs pinctrl
> changes inside clock operations. If the platform is runtime PM
> enabled, the pin controller may require being runtime resumed and
> once again we might fall into the same recursivity issue.
>
> I don't see any positive outcome in trying to guess nor discover these
> cases dynamically and if we want to address that I believe some kind of
> static declarations must be done by the clock controller drivers (and
> trying to do that now will inevitably lead to regressions if we miss
> a single case).
>
> Estimated status: problematic.
>
> ** CONCLUSION AND QUESTIONS **
>
> I've actually drafted the idea explained above. I need more work to get
> it up to a working state, as the whole clock core must be converted
> to this logic in order for it to work. I believe this approach is
> relevant, but it still has blind spots we need to figure out (#4 above)
> for which I would like to get some feedback.
>
> Maybe however the approach will not be acknowledged and another
> direction will be pointed. In this case I would be pleased to discuss it
> and see what we can do. In all cases, the i.MX8 issue is real and must
> be fixed. If we don't do it properly, soon another power-oriented SoC
> will also show similar defects. Basically any Linux based operating
> system running on hardware featuring similarities is subject to deadlock
> at any moment.
>
> I hope this report will be precise enough and raise ideas to help going
> forward and finally address this issue that has been opened since
> 2022 (at least) in a rather satisfying way.
>
> Thanks,
> Miquèl
>
> PS: There are maybe other ways. Of course I had to pick the one that
> felt the more promising, even though I initially considered a couple
> other alternatives:
> * Addressing the issue in the power domain subsystem ? Seemed less
> relevant than in the clock core, because of the physical
> dependencies. But please convince me :-)
> * I saw comments about changing the clk_prepare lock semantics, but we
> will always need to serialize accessed to the clock tree and after
> digging a bit in the clock core I don't see any way out with this
> solution.
> * Preventing power domains from gathering clocks? Also a possible
> approach maybe, even though the situation listed above are far from
> straightforward to address in this case.
More information about the linux-arm-kernel
mailing list