rcu_sched stall while waiting in csd_lock_wait()

Wed Aug 2 20:55:40 PDT 2017

Hi Will,

On Wednesday 02 August 2017 01:38 PM, Will Deacon wrote:
> Hi Pratyush,
> 
> On Wed, Aug 02, 2017 at 09:01:19AM +0530, Pratyush Anand wrote:
>> I am observing following rcu_sched stall while executing `perf record -a  --
>> sleep 1` with one of the arm64 platform. It looks like that stalled cpu was
>> waiting in csd_lock_wait() from where it never came out,and so the stall.
>> Any help/pointer for further debugging would be very helpful. Problem also
>> reproduced with 4.13.0-rc3.
> When you say "also", which other kernel(s) show the problem? Is this a
> recent regression? Which platform are you running on?

Other than 4.13.0-rc3 I had tested with 4.11 based kernel which has 4.11 
vanila+some 4.12 patches back ported, and log which I had attached was from 
same kernel, thats why I mentioned that it reproduced with vanilla upstream as 
well.

> 
> It would be interesting to know what the other CPUs are doing, in particular
> the target of the cross-call. Either it crashed spectacularly and didn't
> unlock the csd lock, or the IPI somehow wasn't delivered.
> 
> Do you see any other splats if you enable lock debugging?

It was same.

Following is the log from 4.13.0-rc3 + patch pointed by Peter:

[  173.649589] perf: interrupt took too long (4952 > 4902), lowering 
kernel.perf_event_max_sample_rate to 40300
[  201.340926] INFO: rcu_sched self-detected stall on CPU
[  201.345115] 	9-...: (6499 ticks this GP) idle=e1a/140000000000001/0 
softirq=334/334 fqs=3250
[  201.353617] 	 (t=6500 jiffies g=313 c=312 q=428)
[  201.358220] Task dump for CPU 9:
[  201.361431] perf            R  running task        0  1888   1864 0x00000202
[  201.368462] Call trace:
[  201.370897] [<ffff000008089498>] dump_backtrace+0x0/0x28c
[  201.376276] [<ffff000008089748>] show_stack+0x24/0x2c
[  201.381312] [<ffff00000810c5d0>] sched_show_task+0x19c/0x26c
[  201.386952] [<ffff00000810df28>] dump_cpu_task+0x48/0x54
[  201.392250] [<ffff00000816c564>] rcu_dump_cpu_stacks+0xac/0xf4
[  201.398063] [<ffff00000816b714>] rcu_check_callbacks+0x908/0xc90
[  201.404053] [<ffff000008172fe0>] update_process_times+0x34/0x5c
[  201.409957] [<ffff00000818697c>] tick_sched_handle.isra.16+0x4c/0x70
[  201.416292] [<ffff0000081869e8>] tick_sched_timer+0x48/0x88
[  201.421847] [<ffff000008173c60>] __hrtimer_run_queues+0x17c/0x604
[  201.427924] [<ffff000008174b90>] hrtimer_interrupt+0xa4/0x1e8
[  201.433656] [<ffff0000086d9d34>] arch_timer_handler_phys+0x3c/0x48
[  201.439818] [<ffff00000815b718>] handle_percpu_devid_irq+0xdc/0x42c
[  201.446069] [<ffff00000815569c>] generic_handle_irq+0x34/0x4c
[  201.451796] [<ffff000008155dec>] __handle_domain_irq+0x6c/0xc4
[  201.457611] [<ffff0000080816c4>] gic_handle_irq+0xa0/0x1b0
[  201.463080] Exception stack(0xffff8016df013a40 to 0xffff8016df013b70)
[  201.469504] 3a40: 0000000000000000 0000000000000003 0000000000000000 
ffff8016df013bd0
[  201.477316] 3a60: ffff8016df013bd0 0000000000080000 ffff8016df013bb8 
ffff0000082113c8
[  201.485129] 3a80: 0000000000000000 0000ffffe507a9f0 00000000ffffffff 
0000000000000000
[  201.492941] 3aa0: 0000000000000005 ffffffffffffffff 002f547d23157399 
00003a2a9f82ac9c
[  201.500754] 3ac0: 0000000000000000 0000000000000000 0000ffffe507a7e0 
ffff000008f5b000
[  201.508566] 3ae0: ffff8016df013c08 ffff000008213fcc 0000000000000013 
ffff8017616a7800
[  201.516379] 3b00: ffff000008f5b000 ffff0000082179d4 0000000000000000 
ffff0000088c1000
[  201.524191] 3b20: ffff8017616a7800 ffff8016df013b70 ffff00000818bf04 
ffff8016df013b70
[  201.532004] 3b40: ffff00000818bf28 0000000020400145 ffff000008213fcc 
0000000000000013
[  201.539816] 3b60: 0001000000000000 ffff8016df013bb8
[  201.544677] [<ffff000008082eb8>] el1_irq+0xb8/0x140
[  201.549539] [<ffff00000818bf28>] smp_call_function_single+0x160/0x184
[  201.555965] [<ffff000008211128>] cpu_function_call+0x40/0x64
[  201.561605] [<ffff0000082113a0>] event_function_call+0x120/0x128
[  201.567594] [<ffff0000082113ec>] _perf_event_disable+0x44/0x64
[  201.573410] [<ffff000008210afc>] perf_event_for_each_child+0x3c/0x84
[  201.579747] [<ffff00000821bcfc>] perf_ioctl+0x21c/0x9a4
[  201.584957] [<ffff0000082eec6c>] do_vfs_ioctl+0xcc/0x874
[  201.590250] [<ffff0000082ef4a4>] sys_ioctl+0x90/0xa4
[  201.595198] [<ffff0000080834a0>] __sys_trace_return+0x0/0x4
[  239.003035] INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 
9-... } 6592 jiffies s: 1149 root: 0x1/.
[  239.012199] blocking rcu_node structures: l=1:0-14:0x200/.
[  239.017695] Task dump for CPU 9:
[  239.020880] perf            R  running task        0  1888   1864 0x00000202
[  239.027929] Call trace:
[  239.030346] [<ffff000008085c9c>] __switch_to+0x64/0x70
[  239.035484] [<ffff000008235a00>] free_pcppages_bulk+0x43c/0x640
[  262.304244] perf: interrupt took too long (6221 > 6190), lowering 
kernel.perf_event_max_sample_rate to 32100
[  367.009704] INFO: task kworker/15:2:1187 blocked for more than 120 seconds.
[  367.015713]       Tainted: G        W       4.13.0-rc3+ #2
[  367.021200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[  367.028994] kworker/15:2    D    0  1187      2 0x00000000
[  367.034481] Workqueue: usb_hub_wq hub_event
[  367.038629] Call trace:
[  367.041077] [<ffff000008085c9c>] __switch_to+0x64/0x70
[  367.046184] [<ffff0000088a3680>] __schedule+0x410/0xcbc
[  367.051403] [<ffff0000088a3f6c>] schedule+0x40/0xa4
[  367.056253] [<ffff00000816a578>] 
_synchronize_rcu_expedited.constprop.64+0x370/0x3e8
[  367.063990] [<ffff00000816a66c>] synchronize_sched_expedited+0x7c/0xf0
[  367.070501] [<ffff00000816aa0c>] synchronize_rcu_expedited+0x14/0x1c
[  367.076825] [<ffff00000825ec24>] bdi_unregister+0x78/0x200
[  367.082307] [<ffff00000841fbac>] del_gendisk+0x130/0x240
[  367.087592] [<ffff000003de0688>] sr_remove+0x30/0x90 [sr_mod]
[  367.093333] [<ffff00000859901c>] device_release_driver_internal+0x15c/0x1dc
[  367.100275] [<ffff0000085990c4>] device_release_driver+0x28/0x34
[  367.106251] [<ffff00000859794c>] bus_remove_device+0xe4/0x14c
[  367.111993] [<ffff000008593890>] device_del+0x1dc/0x308
[  367.117191] [<ffff0000085e121c>] __scsi_remove_device+0xfc/0x128
[  367.123192] [<ffff0000085de894>] scsi_forget_host+0x74/0x78
[  367.128778] [<ffff0000085d0cf8>] scsi_remove_host+0x7c/0x120
[  367.134396] [<ffff000000ad3720>] usb_stor_disconnect+0x58/0xc8 [usb_storage]
[  367.141421] [<ffff000008636fd8>] usb_unbind_interface+0x7c/0x268
[  367.147397] [<ffff00000859901c>] device_release_driver_internal+0x15c/0x1dc
[  367.154355] [<ffff0000085990c4>] device_release_driver+0x28/0x34
[  367.160344] [<ffff00000859794c>] bus_remove_device+0xe4/0x14c
[  367.166060] [<ffff000008593890>] device_del+0x1dc/0x308
[  367.171282] [<ffff0000086344fc>] usb_disable_device+0xc4/0x2ac
[  367.177084] [<ffff000008629de4>] usb_disconnect+0x98/0x248
[  367.182566] [<ffff000008629ff8>] hub_quiesce+0x64/0xac
[  367.187674] [<ffff00000862bfc0>] hub_event+0x110/0xa8c
[  367.192810] [<ffff0000080f2a64>] process_one_work+0x228/0x6a8
[  367.198568] [<ffff0000080f2f44>] worker_thread+0x60/0x3bc
[  367.203922] [<ffff0000080faa40>] kthread+0x114/0x140
[  367.208898] [<ffff0000080833d0>] ret_from_fork+0x10/0x40
[  367.214171] INFO: task kworker/15:3:1890 blocked for more than 120 seconds.
[  367.221108]       Tainted: G        W       4.13.0-rc3+ #2
[  367.226563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[  367.234388] kworker/15:3    D    0  1890      2 0x00000200
[  367.239918] Workqueue: events wait_rcu_exp_gp
[  367.244184] Call trace:
[  367.246615] [<ffff000008085c9c>] __switch_to+0x64/0x70
[  367.251751] [<ffff0000088a3680>] __schedule+0x410/0xcbc
[  367.256945] [<ffff0000088a3f6c>] schedule+0x40/0xa4
[  367.261820] [<ffff0000088a936c>] schedule_timeout+0x1e8/0x454
[  367.267536] [<ffff000008165da0>] rcu_exp_wait_wake+0xb8/0x5b0
[  367.273278] [<ffff00000816a1f8>] wait_rcu_exp_gp+0x3c/0x4c
[  367.278776] [<ffff0000080f2a64>] process_one_work+0x228/0x6a8
[  367.284475] [<ffff0000080f2f44>] worker_thread+0x60/0x3bc
[  367.289931] [<ffff0000080faa40>] kthread+0x114/0x140
[  367.294793] [<ffff0000080833d0>] ret_from_fork+0x10/0x40
[  367.300100] INFO: lockdep is turned off.

-- 
Regards
Pratyush