Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)

David Daney ddaney at caviumnetworks.com
Thu Dec 10 11:43:46 PST 2015


Hi,

We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), 
which is an arm64 implementation.

A typical failure shows multiple threads stuck in mutex operations like 
this:

.
.
.
[   68.909873] Task dump for CPU 18:
[   68.909876] systemd-udevd   R  running task        0   537    534 
0x00000002
[   68.909877] Call trace:
[   68.909880] [<fffffe0000088858>] dump_backtrace+0x0/0x17c
[   68.909883] [<fffffe00000889f8>] show_stack+0x24/0x2c
[   68.909885] [<fffffe00000c4210>] sched_show_task+0xb0/0x104
[   68.909888] [<fffffe00000c682c>] dump_cpu_task+0x48/0x54
[   68.909890] [<fffffe00000ee5e0>] rcu_dump_cpu_stacks+0x9c/0xec
[   68.909893] [<fffffe00000f2c9c>] rcu_check_callbacks+0x524/0xa18
[   68.909896] [<fffffe00000f83a0>] update_process_times+0x44/0x74
[   68.909899] [<fffffe00001078d4>] tick_sched_timer+0x78/0x1ac
[   68.909901] [<fffffe00000f8b74>] __hrtimer_run_queues+0x148/0x2d4
[   68.909903] [<fffffe00000f9464>] hrtimer_interrupt+0xb0/0x1f4
[   68.909906] [<fffffe000056e6e8>] arch_timer_handler_phys+0x3c/0x48
[   68.909909] [<fffffe00000e7fd4>] handle_percpu_devid_irq+0xb0/0x1b0
[   68.909912] [<fffffe00000e33c4>] generic_handle_irq+0x34/0x4c
[   68.909914] [<fffffe00000e3738>] __handle_domain_irq+0x90/0xfc
[   68.909916] [<fffffe0000081d80>] gic_handle_irq+0x90/0x18c
[   68.909918] Exception stack(0xfffffe03f14e3920 to 0xfffffe03f14e3a40)
[   68.909921] 3920: fffffe03fd5c5800 fffffe0000c55800 fffffe03f14e3a80 
fffffe00000dabd8
[   68.909924] 3940: 00000000a0000145 0000000000000015 fffffe03e9602400 
fffffe00002fddb0
[   68.909927] 3960: 0000000000000000 0000000000000000 fffffe03fd5c5810 
fffffe03f14e0000
[   68.909929] 3980: 0000000000000001 ffffffffff000000 fffffe03db307e38 
0000000000000000
[   68.909932] 39a0: 0000000000737973 00000000ffffffff 0000000000000000 
000000003b364d50
[   68.909935] 39c0: 0000000000000018 ffffffffa99641af 0016fd71b6000000 
003b9aca00000000
[   68.909937] 39e0: fffffe00001f1508 000003ff9b9fd028 000003ffed7a0a10 
fffffe03fd5c5800
[   68.909940] 3a00: fffffe0000c55800 fffffe0000cea1c8 fffffe03fd5a5800 
fffffe0000ca2eb0
[   68.909943] 3a20: 0000000000000015 fffffe03e9602400 fffffe0000cea1c8 
fffffe0000712000
[   68.909945] [<fffffe0000084ce8>] el1_irq+0x68/0xd8
[   68.909948] [<fffffe00000da03c>] mutex_optimistic_spin+0x9c/0x1d0
[   68.909951] [<fffffe00006fe4b8>] __mutex_lock_slowpath+0x44/0x158
[   68.909953] [<fffffe00006fe620>] mutex_lock+0x54/0x58
[   68.909956] [<fffffe0000265efc>] kernfs_iop_permission+0x38/0x70
[   68.909959] [<fffffe00001fbf50>] __inode_permission+0x88/0xd8
[   68.909961] [<fffffe00001fbfd0>] inode_permission+0x30/0x6c
[   68.909964] [<fffffe00001fe26c>] link_path_walk+0x68/0x4d4
[   68.909966] [<fffffe00001ffa14>] path_openat+0xb4/0x2bc
[   68.909968] [<fffffe000020123c>] do_filp_open+0x74/0xd0
[   68.909971] [<fffffe00001f13e4>] do_sys_open+0x14c/0x228
[   68.909973] [<fffffe00001f1544>] SyS_openat+0x3c/0x48
[   68.909976] [<fffffe00000851f0>] el0_svc_naked+0x24/0x28
.
.
.

Reverting 81a43adae3b9 (locking/mutex: Use acquire/release semantics) 
Makes the problem go away.

At this point it is unknown if this patch is incorrect, or if the 
underlying ARM64 atomic_*_{acquire,release} primitives are defective, or 
if the problem lies elsewhere.

I am not requesting any specific action with this e-mail, but wanted to 
draw attention to the issue.  Undoubtedly we will be able to provide 
more detailed information about the issue in the coming days.

Thanks,
David Daney




More information about the linux-arm-kernel mailing list