Random stack corruption on v5.13 with dra76
Tomi Valkeinen
tomi.valkeinen at ideasonboard.com
Fri May 21 05:57:43 PDT 2021
On 21/05/2021 13:30, Tomi Valkeinen wrote:
> On 21/05/2021 12:14, Tony Lindgren wrote:
>> * Tomi Valkeinen <tomi.valkeinen at ideasonboard.com> [210521 08:45]:
>>> On 21/05/2021 10:39, Tony Lindgren wrote:
>>>> * Tomi Valkeinen <tomi.valkeinen at ideasonboard.com> [210521 07:05]:
>>>>> On 21/05/2021 08:36, Tony Lindgren wrote:
>>>>>> * Tomi Valkeinen <tomi.valkeinen at ideasonboard.com> [210520 08:27]:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've noticed that the v5.13 rcs crash randomly (but quite often)
>>>>>>> on dra76 evm
>>>>>>> (I haven't tested other boards). Anyone else seen this problem?
>>>>>>
>>>>>> I have not seen this so far and beagle-x15 is behaving for me.
>>>>>>
>>>>>> Does it always happen on boot?
>>>>>
>>>>> No, but quite often. I can't really say how often, as it's
>>>>> annoyingly random.
>>>>> I tried to bisect, but that proved to be difficult as sometimes I
>>>>> get multiple (5+)
>>>>> successful boots before the crash.
>>>>>
>>>>> I tested with x15, same issue (below). So... Something in my kernel
>>>>> config? Or compiler?
>>>>> Looks like the crash happens always very soon after (or during)
>>>>> probing palmas.
>>>>
>>>> After about 10 reboots with your .config I'm seeing it now too on
>>>> beagle-x15. So far no luck reproducing it with omap2plus_defconfig.
>>>
>>> I think I have an easy way to see if a kernel is good or bad, by
>>> printing
>>> stack_not_used(current) in the first call to omap_i2c_xfer_irq().
>>> There's a
>>> huge drop between v5.12 and v5.13-rc1.
>>>
>>> And interestingly, sometimes a simple printk seems to use hundreds of
>>> bytes
>>> of stack (i.e. compare stack usage before and after the print). But not
>>> always. So maybe the issue is somehow related to printk.
>>>
>>> I'm bisecting.
>>
>> OK sounds good to me.
>
> Well, I found the bad commit but unfortunately it doesn't exactly point
> where the issue is.
>
> f483a3e123410bd1c78af295bf65feffb6769a98 is the first bad commit
> commit f483a3e123410bd1c78af295bf65feffb6769a98
> Author: Tony Lindgren <tony at atomide.com>
> Date: Wed Mar 10 14:03:48 2021 +0200
>
> ARM: dts: Configure simple-pm-bus for dra7 l4_per1
>
> We can now probe interconnects with device tree only configuration
> using
> simple-pm-bus and genpd.
>
> Tested-by: Kishon Vijay Abraham I <kishon at ti.com>
> Signed-off-by: Tony Lindgren <tony at atomide.com>
>
> arch/arm/boot/dts/dra7-l4.dtsi | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
>
> The difference is clear, though. With
> 9a75368b6426739e8b798592f084cb682d760568, which is the last good commit,
> when I print the stack usage with stack_not_used() in three different
> places in omap_i2c_xfer_irq(), I get always prints roughly like:
>
> STACK FREE omap_i2c_xfer_irq: 2972, 2972, 2972
>
> And these repeat exactly the same for each call to omap_i2c_xfer_irq (at
> least during palmas probe).
>
> With the bad commit the situation is different. The first call to
> omap_i2c_xfer_irq prints:
>
> STACK FREE omap_i2c_xfer_irq: 2024, 2024, 2024
>
> so we're already using 1k more. But then, instead of the stack usage
> staying the same, consecutive calls show increased stack usage. It
> doesn't increase for each xfer call, but after about 10 calls, I'm
> getting ~1800, ten calls more I see ~800, and going down to ~500.
>
> However, with this bad commit, I don't see the empty stack going below
> ~500, so I don't get crashes. But going to a more recent commit, like
> 01d7136894410a71932096e0fb9f1d301b6ccf07, the situation is much worse.
> The first print shows:
>
> STACK FREE omap_i2c_xfer_irq: 1164, 1164, 1164
>
> and it quickly goes to stack overflow.
Well... I guess there's no bug as such. The simple-pm-bus just uses a
lot more stack. Looking at the reports from CONFIG_STACK_TRACER, in the
good commit the biggest stack is in USB probing with depth of 4896. With
01d7136894410a71932096e0fb9f1d301b6ccf07, and a debug print in
omap_i2c_xfer_common, I can get to 7544.
I don't know what to do about this, so I'm giving up for the time being
and "fix" it for myself by increasing THREAD_SIZE_ORDER.
# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (143 entries)
----- ---- --------
0) 7544 28 ftrace_graph_call+0x0/0xc
1) 7516 28 rcu_read_lock_sched_held+0x38/0x88
2) 7488 120 lock_acquire+0x1ac/0x560
3) 7368 16 _raw_spin_lock_irqsave+0x64/0x80
4) 7352 56 timekeeping_advance+0x3c/0x934
5) 7296 144 update_wall_time+0x1c/0x20
6) 7152 16 tick_do_update_jiffies64+0x1f0/0x304
7) 7136 104 tick_sched_timer+0xf0/0x10c
8) 7032 96 __hrtimer_run_queues+0x2d8/0x68c
9) 6936 112 hrtimer_interrupt+0x158/0x388
10) 6824 16 dmtimer_clockevent_interrupt+0x34/0x3c
11) 6808 64 __handle_irq_event_percpu+0x108/0x384
12) 6744 40 handle_irq_event+0x68/0xe0
13) 6704 32 handle_fasteoi_irq+0xbc/0x224
14) 6672 40 __handle_domain_irq+0x98/0x10c
15) 6632 48 gic_handle_irq+0x9c/0xc4
16) 6584 68 __irq_svc+0x70/0x98
17) 6516 60 console_unlock+0x444/0x688
18) 6456 196 vprintk_emit+0x138/0x344
19) 6260 68 vprintk_default+0x34/0x3c
20) 6192 48 vprintk_func+0xec/0x194
21) 6144 24 printk+0x3c/0x64
22) 6120 104 omap_i2c_xfer_common+0x698/0x6e4
23) 6016 16 omap_i2c_xfer_irq+0x1c/0x20
24) 6000 80 __i2c_transfer+0x184/0xa80
25) 5920 40 i2c_transfer+0xa8/0x144
26) 5880 64 regmap_i2c_read+0x6c/0xa8
27) 5816 72 _regmap_raw_read+0x138/0x474
28) 5744 40 _regmap_bus_read+0x54/0x80
29) 5704 56 _regmap_read+0x74/0x248
30) 5648 32 regmap_read+0x4c/0x70
31) 5616 32 regulator_get_voltage_sel_regmap+0x40/0x94
32) 5584 32 regulator_attr_is_visible+0x18c/0x278
33) 5552 88 internal_create_group+0xe4/0x3e0
34) 5464 40 internal_create_groups.part.0+0x54/0x98
35) 5424 16 sysfs_create_groups+0x24/0x30
36) 5408 96 device_add+0x280/0x944
37) 5312 104 regulator_register+0x694/0xc4c
38) 5208 32 devm_regulator_register+0x48/0x84
39) 5176 72 palmas_smps_registration+0x1b4/0x480
40) 5104 136 palmas_regulators_probe+0x348/0x450
41) 4968 32 platform_probe+0x68/0xc8
42) 4936 64 really_probe+0x114/0x588
43) 4872 48 driver_probe_device+0x88/0x1f0
44) 4824 32 __device_attach_driver+0xac/0x118
45) 4792 48 bus_for_each_drv+0x90/0xe0
46) 4744 48 __device_attach+0xe0/0x1ec
47) 4696 16 device_initial_probe+0x1c/0x20
48) 4680 32 bus_probe_device+0x98/0xa0
49) 4648 96 device_add+0x37c/0x944
50) 4552 16 of_device_add+0x44/0x4c
51) 4536 40 of_platform_device_create_pdata+0xa0/0xcc
52) 4496 112 of_platform_bus_create+0x1bc/0x34c
53) 4384 48 of_platform_populate+0x90/0x124
54) 4336 32 devm_of_platform_populate+0x4c/0x90
55) 4304 88 palmas_i2c_probe+0x478/0x608
56) 4216 40 i2c_device_probe+0x2a0/0x2f8
57) 4176 64 really_probe+0x114/0x588
58) 4112 48 driver_probe_device+0x88/0x1f0
59) 4064 32 __device_attach_driver+0xac/0x118
60) 4032 48 bus_for_each_drv+0x90/0xe0
61) 3984 48 __device_attach+0xe0/0x1ec
62) 3936 16 device_initial_probe+0x1c/0x20
63) 3920 32 bus_probe_device+0x98/0xa0
64) 3888 96 device_add+0x37c/0x944
65) 3792 24 device_register+0x24/0x28
66) 3768 48 i2c_new_client_device+0x140/0x264
67) 3720 112 of_i2c_register_devices+0xdc/0x188
68) 3608 64 i2c_register_adapter+0x220/0x700
69) 3544 32 __i2c_add_numbered_adapter+0x60/0xac
70) 3512 32 i2c_add_adapter+0xa0/0xe4
71) 3480 16 i2c_add_numbered_adapter+0x2c/0x30
72) 3464 80 omap_i2c_probe+0x404/0x6b8
73) 3384 32 platform_probe+0x68/0xc8
74) 3352 64 really_probe+0x114/0x588
75) 3288 48 driver_probe_device+0x88/0x1f0
76) 3240 32 __device_attach_driver+0xac/0x118
77) 3208 48 bus_for_each_drv+0x90/0xe0
78) 3160 48 __device_attach+0xe0/0x1ec
79) 3112 16 device_initial_probe+0x1c/0x20
80) 3096 32 bus_probe_device+0x98/0xa0
81) 3064 96 device_add+0x37c/0x944
82) 2968 16 of_device_add+0x44/0x4c
83) 2952 40 of_platform_device_create_pdata+0xa0/0xcc
84) 2912 112 of_platform_bus_create+0x1bc/0x34c
85) 2800 48 of_platform_populate+0x90/0x124
86) 2752 104 sysc_probe+0x10e4/0x1628
87) 2648 32 platform_probe+0x68/0xc8
88) 2616 64 really_probe+0x114/0x588
89) 2552 48 driver_probe_device+0x88/0x1f0
90) 2504 32 __device_attach_driver+0xac/0x118
91) 2472 48 bus_for_each_drv+0x90/0xe0
92) 2424 48 __device_attach+0xe0/0x1ec
93) 2376 16 device_initial_probe+0x1c/0x20
94) 2360 32 bus_probe_device+0x98/0xa0
95) 2328 96 device_add+0x37c/0x944
96) 2232 16 of_device_add+0x44/0x4c
97) 2216 40 of_platform_device_create_pdata+0xa0/0xcc
98) 2176 112 of_platform_bus_create+0x1bc/0x34c
99) 2064 48 of_platform_populate+0x90/0x124
100) 2016 32 simple_pm_bus_probe+0x54/0x80
101) 1984 32 platform_probe+0x68/0xc8
102) 1952 64 really_probe+0x114/0x588
103) 1888 48 driver_probe_device+0x88/0x1f0
104) 1840 32 __device_attach_driver+0xac/0x118
105) 1808 48 bus_for_each_drv+0x90/0xe0
106) 1760 48 __device_attach+0xe0/0x1ec
107) 1712 16 device_initial_probe+0x1c/0x20
108) 1696 32 bus_probe_device+0x98/0xa0
109) 1664 96 device_add+0x37c/0x944
110) 1568 16 of_device_add+0x44/0x4c
111) 1552 40 of_platform_device_create_pdata+0xa0/0xcc
112) 1512 112 of_platform_bus_create+0x1bc/0x34c
113) 1400 48 of_platform_populate+0x90/0x124
114) 1352 32 simple_pm_bus_probe+0x54/0x80
115) 1320 32 platform_probe+0x68/0xc8
116) 1288 64 really_probe+0x114/0x588
117) 1224 48 driver_probe_device+0x88/0x1f0
118) 1176 32 __device_attach_driver+0xac/0x118
119) 1144 48 bus_for_each_drv+0x90/0xe0
120) 1096 48 __device_attach+0xe0/0x1ec
121) 1048 16 device_initial_probe+0x1c/0x20
122) 1032 32 bus_probe_device+0x98/0xa0
123) 1000 96 device_add+0x37c/0x944
124) 904 16 of_device_add+0x44/0x4c
125) 888 40 of_platform_device_create_pdata+0xa0/0xcc
126) 848 112 of_platform_bus_create+0x1bc/0x34c
127) 736 48 of_platform_populate+0x90/0x124
128) 688 32 simple_pm_bus_probe+0x54/0x80
129) 656 32 platform_probe+0x68/0xc8
130) 624 64 really_probe+0x114/0x588
131) 560 48 driver_probe_device+0x88/0x1f0
132) 512 32 __device_attach_driver+0xac/0x118
133) 480 48 bus_for_each_drv+0x90/0xe0
134) 432 48 __device_attach+0xe0/0x1ec
135) 384 16 device_initial_probe+0x1c/0x20
136) 368 32 bus_probe_device+0x98/0xa0
137) 336 48 deferred_probe_work_func+0xac/0xfc
138) 288 104 process_one_work+0x270/0x7c4
139) 184 48 worker_thread+0x1ec/0x524
140) 136 48 kthread+0x160/0x178
141) 88 8 ret_from_fork+0x14/0x38
142) 80 80 0x0
Tomi
More information about the linux-arm-kernel
mailing list