[PATCH] pmdomain: mediatek: fix race condition in power on/power off sequences
AngeloGioacchino Del Regno
angelogioacchino.delregno at collabora.com
Wed Nov 29 06:41:04 PST 2023
Il 29/11/23 14:48, Eugen Hristev ha scritto:
> On 11/29/23 15:37, AngeloGioacchino Del Regno wrote:
>> Il 29/11/23 14:28, Eugen Hristev ha scritto:
>>> On 11/29/23 14:52, AngeloGioacchino Del Regno wrote:
>>>> Il 29/11/23 12:31, Eugen Hristev ha scritto:
>>>>> It can happen that during the power off sequence for a power domain
>>>>> another power on sequence is started, and it can lead to powering on and
>>>>> off in the same time for the similar power domain.
>>>>> This can happen if parallel probing occurs: one device starts probing, and
>>>>> one power domain is probe deferred, this leads to all power domains being
>>>>> rolled back and powered off, while in the same time another device starts
>>>>> probing and requests powering on the same power domains or similar.
>>>>>
>>>>> This was encountered on MT8186, when the sequence is :
>>>>> Power on SSUSB
>>>>> Power on SSUSB_P1
>>>>> Power on DIS
>>>>> -> probe deferred
>>>>> Power off DIS
>>>>> Power off SSUSB_P1
>>>>> Power off SSUSB
>>>>>
>>>>> During the sequence of powering off SSUSB, some new similar sequence starts,
>>>>> and during the power on of SSUSB, clocks are enabled.
>>>>> In this case, powering off SSUSB fails from the first sequence, because
>>>>> power off ACK bit check times out (as clocks are powered back on by the second
>>>>> sequence). In consequence, powering it on also times out, and it leads to
>>>>> the whole power domain in a bad state.
>>>>>
>>>>> To solve this issue, added a mutex that locks the whole power off/power on
>>>>> sequence such that it would never happen that multiple sequences try to
>>>>> enable or disable the same power domain in parallel.
>>>>>
>>>>> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains")
>>>>> Signed-off-by: Eugen Hristev <eugen.hristev at collabora.com>
>>>>
>>>> I don't think that it's a race between genpd_power_on() and genpd_power_off()
>>>> calls
>>>> at all, because genpd *does* have locking after all... at least for probe and for
>>>> parents of a power domain (and more anyway).
>>>>
>>>> As far as I remember, what happens when you start .probe()'ing a device is:
>>>> platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach()
>>>>
>>>> There, you end up with
>>>>
>>>> if (power_on) {
>>>> genpd_lock(pd);
>>>> ret = genpd_power_on(pd, 0);
>>>> genpd_unlock(pd);
>>>> }
>>>>
>>>> ...but when you fail probing, you go with genpd_dev_pm_detach(), which then calls
>>>>
>>>> /* Check if PM domain can be powered off after removing this device. */
>>>> genpd_queue_power_off_work(pd);
>>>>
>>>> but even then, you end up being in a worker doing
>>>>
>>>> genpd_lock(genpd);
>>>> genpd_power_off(genpd, false, 0);
>>>> genpd_unlock(genpd);
>>>>
>>>> ...so I don't understand why this mutex can resolve the situation here (also: are
>>>> you really sure that the race is solved like that?)
>>>>
>>>> I'd say that this probably needs more justification and a trace of the actual
>>>> situation here.
>>>>
>>>> Besides, if this really resolves the issue, I would prefer seeing variants of
>>>> scpsys_power_{on,off}() functions, because we anyway don't need to lock mutexes
>>>> during this driver's probe (add_subdomain calls scpsys_power_on()).
>>>> In that case, `scpsys_power_on_unlocked()` would be an idea... but still, please
>>>> analyze why your solution works, if it does, because I'm not convinced.
>>>
>>> What I see in my tests, is that a power on call for SSUSB domain happens while
>>> the previous power off sequence did not yet complete, most likely while it's
>>> waiting in readx_poll_timeout . This leads to inconsistency of the power domain,
>>> not getting the ACKs next time a power on attempt occurs.
>>>
>>> I understand what you say about locks, but in this case the powering off is not
>>> called by the genpd itself, but rather it's called by the rollback probe failed
>>> mechanism : when the probing fails, scpsys_domain_cleanup() is called during the
>>> same probing session.
>>> Then it happens that probing begins again and previous cleanup is not yet
>>> completed. I am not sure whether the lock is still held from the previous run,
>>> but it's clearly not waiting for a lock to be released to be called again.
>>>
>>
>> Sorry but I'm a bit lost now: is the problem about probe deferrals of the USB
>> driver, or about probe deferrals of the mtk-pm-domains driver?
>>
>> scpsys_domain_cleanup() is only called upon scpsys_probe() failure.
>
> You are right, my explanation was bad.
>
> It happens during the mtk-pm-domains driver probe.
>
> Not all domains can power up, then everything is rolled back. and this happens
> multiple times
> On rare occasions, it happens that another probing sequence starts while the
> previous one was not finished .
> I mentioned devices because I had in mind the fact that each device requires a
> power domain, and parallel probing of these devices causes a call to mtk-pm-domains
> driver probe to be called from two different places.
>
> e.g. device 1 probes -> call mtk-pm-domains probe because it requires X power domain
>
> device 2 probes -> call mtk-pm-domains probe because it requires Y power domain.
>
> First attempt fails but not completed while second attempt starts.
>
> Maybe this is a better explanation of the situation ?
Yeah, now it's a bit clearer!
At this point, I think that you can get away with locking just one path (or two):
/* This is the one giving me lots of suspects */
static void scpsys_remove_one_domain(struct scpsys_domain *pd)
{
int ret;
***lock***
if (scpsys_domain_is_on(pd))
scpsys_power_off(&pd->genpd);
***unlock***
.....
}
/* This one as well eventually */
static struct
generic_pm_domain *scpsys_add_one_domain(struct scpsys *scpsys, struct device_node
*node)
{
...............
if (MTK_SCPD_CAPS(pd, MTK_SCPD_KEEP_DEFAULT_OFF)) {
if (scpsys_domain_is_on(pd)) /* Maybe LOCK this one too? */
dev_warn(scpsys->dev,
"%pOF: A default off power domain has been ON\n", node);
} else {
** *** lock ***
ret = scpsys_power_on(&pd->genpd);
** *** unlock ***
if (ret < 0) {
dev_err(scpsys->dev, "%pOF: failed to power on domain: %d\n", node, ret);
goto err_put_subsys_clocks;
}
if (MTK_SCPD_CAPS(pd, MTK_SCPD_ALWAYS_ON))
pd->genpd.flags |= GENPD_FLAG_ALWAYS_ON;
}
..........
}
Can you please try locking only the remove_one_domain() poweroff call before
trying both?
Reason is that in the add_one_domain() case, we haven't registered the power
domain yet, so locking may not be required there to make things ticking right.
Cheers!
>>
>>>>
>>>>> ---
>>>>> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++-----
>>>>> 1 file changed, 19 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c
>>>>> b/drivers/pmdomain/mediatek/mtk-pm-domains.c
>>>>> index d5f0ee05c794..4f136b47e539 100644
>>>>> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c
>>>>> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c
>>>>> @@ -9,6 +9,7 @@
>>>>> #include <linux/io.h>
>>>>> #include <linux/iopoll.h>
>>>>> #include <linux/mfd/syscon.h>
>>>>> +#include <linux/mutex.h>
>>>>> #include <linux/of.h>
>>>>> #include <linux/of_clk.h>
>>>>> #include <linux/platform_device.h>
>>>>> @@ -56,6 +57,7 @@ struct scpsys {
>>>>> struct device *dev;
>>>>> struct regmap *base;
>>>>> const struct scpsys_soc_data *soc_data;
>>>>> + struct mutex mutex;
>>>>> struct genpd_onecell_data pd_data;
>>>>> struct generic_pm_domain *domains[];
>>>>> };
>>>>> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
>>>>> bool tmp;
>>>>> int ret;
>>>>> + mutex_lock(&scpsys->mutex);
>>>>> +
>>>>> ret = scpsys_regulator_enable(pd->supply);
>>>>> - if (ret)
>>>>> + if (ret) {
>>>>> + mutex_unlock(&scpsys->mutex);
>>>>> return ret;
>>>>> + }
>>>>> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks);
>>>>> if (ret)
>>>>> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
>>>>> goto err_enable_bus_protect;
>>>>> }
>>>>> + mutex_unlock(&scpsys->mutex);
>>>>> return 0;
>>>>> err_enable_bus_protect:
>>>>> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
>>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks);
>>>>> err_reg:
>>>>> scpsys_regulator_disable(pd->supply);
>>>>> + mutex_unlock(&scpsys->mutex);
>>>>> return ret;
>>>>> }
>>>>> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain
>>>>> *genpd)
>>>>> bool tmp;
>>>>> int ret;
>>>>> + mutex_lock(&scpsys->mutex);
>>>>> +
>>>>> ret = scpsys_bus_protect_enable(pd);
>>>>> if (ret < 0)
>>>>> - return ret;
>>>>> + goto err_mutex_unlock;
>>>>> ret = scpsys_sram_disable(pd);
>>>>> if (ret < 0)
>>>>> - return ret;
>>>>> + goto err_mutex_unlock;
>>>>> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd,
>>>>> MTK_SCPD_EXT_BUCK_ISO))
>>>>> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs,
>>>>> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain
>>>>> *genpd)
>>>>> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp,
>>>>> MTK_POLL_DELAY_US,
>>>>> MTK_POLL_TIMEOUT);
>>>>> if (ret < 0)
>>>>> - return ret;
>>>>> + goto err_mutex_unlock;
>>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks);
>>>>> scpsys_regulator_disable(pd->supply);
>>>>> - return 0;
>>>>> +err_mutex_unlock:
>>>>> + mutex_unlock(&scpsys->mutex);
>>>>> + return ret;
>>>>> }
>>>>> static struct
>>>>> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev)
>>>>> return PTR_ERR(scpsys->base);
>>>>> }
>>>>> + mutex_init(&scpsys->mutex);
>>>>> +
>>>>> ret = -ENODEV;
>>>>> for_each_available_child_of_node(np, node) {
>>>>> struct generic_pm_domain *domain;
>>>>
>>
More information about the linux-arm-kernel
mailing list