[PATCH v7 00/11] Stop monitoring disabled devices

Daniel Lezcano daniel.lezcano at linaro.org
Thu Jul 2 13:49:51 EDT 2020


On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote:
> Hi,
> 
> W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
>> On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
>>> Hi Daniel,
>>>
>>> <snip>
>>>
>>>>>>>>
>>>>>>>> I did reproduce:
>>>>>>>>
>>>>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>>>>> v5.8-rc3 => imx6 boots correctly
>>>>
>>>> So finally I succeeded to reproduce it on my imx7 locally. The sensor
>>>> was failing to initialize for another reason related to the legacy
>>>> cooling device, this is why it is not appearing on the imx7.
>>>>
>>>> I can now git-bisect :)
>>>>
>>>
>>> That would be very kind of you, thank you!
>>
>> With the lock correctness option enabled:
>>
>> [    4.179223] imx_thermal tempmon: Extended Commercial CPU temperature
>> grade - max:105C critical:100C passive:95C
>> [    4.189557]
>> [    4.191060] ============================================
>> [    4.196378] WARNING: possible recursive locking detected
>> [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
>> [    4.207102] --------------------------------------------
>> [    4.212421] kworker/0:3/54 is trying to acquire lock:
>> [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_device_is_enabled+0x18/0x34
>> [    4.225777]
>> [    4.225777] but task is already holding lock:
>> [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_get_temp+0x38/0x6c
>> [    4.239121]
>> [    4.239121] other info that might help us debug this:
>> [    4.245655]  Possible unsafe locking scenario:
>> [    4.245655]
>> [    4.251579]        CPU0
>> [    4.254031]        ----
>> [    4.256481]   lock(&tz->lock);
>> [    4.259544]   lock(&tz->lock);
>> [    4.262608]
>> [    4.262608]  *** DEADLOCK ***
>> [    4.262608]
>> [    4.268533]  May be due to missing lock nesting notation
>> [    4.268533]
>> [    4.275329] 4 locks held by kworker/0:3/54:
>> [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:
>> process_one_work+0x224/0x808
>> [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:
>> process_one_work+0x224/0x808
>> [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
>> __device_attach+0x30/0x140
>> [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_get_temp+0x38/0x6c
>> [    4.312408]
>> [    4.312408] stack backtrace:
>> [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
>> 5.8.0-rc3-00011-gf5e50bf4d3ef #42
>> [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
>> [    4.330809] Workqueue: events deferred_probe_work_func
>> [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
>> (show_stack+0x10/0x14)
>> [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
>> (dump_stack+0xe8/0x114)
>> [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
>> (__lock_acquire+0xbfc/0x2cb4)
>> [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
>> (lock_acquire+0xf4/0x4e4)
>> [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
>> (__mutex_lock+0xb0/0xaa8)
>> [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
>> (mutex_lock_nested+0x1c/0x24)
>> [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
>> (thermal_zone_device_is_enabled+0x18/0x34)
>> [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
>> [<c0d9da90>] (imx_get_temp+0x30/0x208)
>> [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
>> (thermal_zone_get_temp+0x4c/0x6c)
>> [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]
>> (thermal_zone_device_update+0x8c/0x258)
>> [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from
>> [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
>> [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
>> [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
>> [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
>> (platform_drv_probe+0x48/0x98)
>> [    4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]
>> (really_probe+0x218/0x348)
>> [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
>> (driver_probe_device+0x5c/0xb4)
>> [    4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]
>> (bus_for_each_drv+0x58/0xb8)
>> [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
>> (__device_attach+0xd4/0x140)
>> [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
>> (bus_probe_device+0x88/0x90)
>> [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
>> (deferred_probe_work_func+0x68/0x98)
>> [    4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]
>> (process_one_work+0x2e0/0x808)
>> [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
>> (worker_thread+0x2a0/0x59c)
>> [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
>> (kthread+0x16c/0x178)
>> [    4.522843] [<c0372208>] (kthread) from [<c0300174>]
>> (ret_from_fork+0x14/0x20)
>> [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
>> [    4.535138] 5fa0:                                     00000000
>> 00000000 00000000 00000000
>> [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
>> 00000000 00000000 00000000
>> [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013
>> 00000000
>>
>>
>>
> 
> Thanks!
> 
> That confirms your suspicions.
> 
> So the reason is that ->get_temp() is called while the mutex is held and
> thermal_zone_device_is_enabled() wants to take the same mutex.

Yes, that's correct.

> Is adding a comment to thermal_zone_device_is_enabled() to never call
> it while the mutex is held and adding another version of it which does
> not take the mutex ok?

The thermal_zone_device_is_enabled() is only used in two places, acpi
and this imx driver, and given:

1. as soon as the mutex is released, there is no guarantee the thermal
zone won't be changed right after, the lock is pointless, thus the
information also.

2. from a design point of view, I don't see why a driver should know if
a thermal zone is disabled or not

It would make sense to end with this function and do not give the
different drivers an opportunity to access this information.

Why not add change_mode for the acpi in order to enable or disable the
events and for imx_thermal use irq_enabled flag instead of the thermal
zone mode? Moreover it is very unclear why this function is needed in
imx_get_temp(), and I suspect we should be able to get rid of it.


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog



More information about the linux-arm-kernel mailing list