mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging

Marek Szyprowski m.szyprowski at samsung.com
Tue Dec 12 03:38:52 PST 2017


Hi All,

On 2017-12-11 23:28, Javier Martinez Canillas wrote:
> [adding Marek and Shuah to cc list]
>
> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>> <guillaume.tucker at collabora.com> wrote:
>>> Hi Daniel,
>>>
>>> Please see below, I've had several bisection results pointing at
>>> that commit over the week-end on mainline but also on linux-next
>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>> and is likely to have more than one issue, it does seem like this
>>> commit is causing some well reproducible kernel hang.
>>>
>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>
>>> and here's another one with the change mentioned below reverted:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>
>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>> I don't know if this is related as I haven't investigated any
>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>> boot most of the time with the commit reverted though.
>>>
>>> The automated kernelci.org bisection is still an experimental
>>> tool and it may well be a false positive, so please take this
>>> result with a pinch of salt...
>> The patch just very minimal moves the connector cleanup around (so
>> timing change), but except when you unload a driver (or maybe that
>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>> more info than "seems to hang a bit more" I have no idea what's wrong.
>> The patch itself should work, at least it survived quite some serious
>> testing we do on everything.
>> -Daniel
>>
> Marek was pointing to a different culprit [0] in this [1] thread. I
> see that both commits made it to v4.15-rc3, which is the first version
> where boot fails. So maybe is a combination of both? Or rather
> reverting one patch masks the error in the other.
>
> I've access to the machine but unfortunately not a lot of time to dig
> on this, I could try to do it in the weekend though.

After a recent discussion on the Javier's patch:
https://patchwork.kernel.org/patch/10106417/
I've managed to reproduce this issue also on Exynos5250 based Samsung
Snow Chromebook and investigate a bit.

It is caused by a deadlock in the main kernel workqueue. Here are details:

1. Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list

2. Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()

3. exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device

4. error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()

5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.

Do You have idea how to fix this issue properly?

Taking a look at git blame, this indeed shows that the issue has been
introduced by the commit a703c55004e1 ("drm: safely free connectors from
connector_ite"), which added a call to flush_scheduled_work() in
drm_mode_config_cleanup().

drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
called from the workqueue, but I don't have idea how to check that. The
other way of fixing it would be to resurrect separate workqueue for DRM
related events.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland




More information about the linux-arm-kernel mailing list