[BUG] CONFIG_UNINLINE_SPIN_UNLOCK important for Cortex-A9
Arnd Bergmann
arnd at arndb.de
Tue May 31 04:43:43 PDT 2016
On Tuesday, May 31, 2016 12:51:03 PM CEST Holger Schurig wrote:
> > Have you tried multi_v7_defconfig? That also does not set
> > CONFIG_UNINLINE_SPIN_UNLOCK, but it is generally assumed to work.
>
> Not yet, will try this (see below).
Ok
> > Just to be sure: If you just enable UNINLINE_SPIN_UNLOCK (e.g. using
> > a 'select' from arch/arm/mach-imx/Kconfig) without selecting PROVE_RCU,
> > you say the bug is gone too?
>
> Currently I select it indirectly. One person that tried to help me
> turned on 29 different kernel debug options and the bug was gone. Then I
> reduced this step by step to find out which of them was needed
> minimally.
>
> Still I don't think that it's an RCU issue. But it could of course be a
> "Heisenbug".
I would also not guess RCU as the first possibility. I was first guessing
that something might be wrong with arch_spin_unlock(), but it's actually
a really trivial function:
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
smp_mb();
lock->tickets.owner++;
dsb_sev();
}
so there is really not much that can go wrong, and if there is,
then presumably either smp_mb() or dsb_sev() is not as strong
a barrier here as it should be, and that seems very unlikely.
The last change to the function was a while ago, but you could
try reverting this commit if nothing else helps:
commit 20e260b6f4f717c100620122f626a2c06a4cfd72
Author: Will Deacon <will.deacon at arm.com>
Date: Thu Jan 24 14:47:38 2013 +0100
ARM: 7632/1: spinlock: avoid exclusive accesses on unlock() path
When unlocking a spinlock, all we need to do is increment the owner
field of the lock. Since only one CPU can be performing an unlock()
operation for a given lock, this doesn't need to be exclusive.
This patch simplifies arch_spin_unlock to use non-exclusive accesses
when updating the owner field of the lock.
Signed-off-by: Will Deacon <will.deacon at arm.com>
Signed-off-by: Russell King <rmk+kernel at arm.linux.org.uk>
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index b4ca707d0a69..6220e9fdf4c7 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -119,22 +119,8 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
- unsigned long tmp;
- u32 slock;
-
smp_mb();
-
- __asm__ __volatile__(
-" mov %1, #1\n"
-"1: ldrex %0, [%2]\n"
-" uadd16 %0, %0, %1\n"
-" strex %1, %0, [%2]\n"
-" teq %1, #0\n"
-" bne 1b"
- : "=&r" (slock), "=&r" (tmp)
- : "r" (&lock->slock)
- : "cc");
-
+ lock->tickets.owner++;
dsb_sev();
}
> > - Specific kernel version, and full list of patches applied (if any).
> > Have you reproduced this with a plain unpatched linux-4.6 kernel?
>
> I'm not in love with vendor-kernels, so I use kernels from kernel.org.
> When I started to chase the oops happened in 4.4.4, but the behavior was
> also present in 4.5, 4.5.1 ... 4.5.4 and 4.6.
>
> I have some patches applied, e.g. aufs4, one patch in the i.MX6Q FEC
> ethernet driver and stuff to program the eMMC firmware. Nothing that
> plays with rcu, spinlocks etc, e.g. no preempt or Linux-RT patches.
>
> However, a vanilla kernel boots on my device (isn't device tree
> nice?!?!?) and here I had the same behavior. I think that vanilla kernel
> was 4.5.3, but I'm unsure.
Ok, good, that should make it easy to reproduce.
>
> > - Specific gcc version you used for building the kernel. Does this
> > happen with both old and new compilers, e.g. 4.7 and 6.1?
>
> I don't have yet a gcc 6.1 armhf cross compiler.
It's fairly straightforward to build gcc binaries for testing kernels
using Segher's "buildall" script from
git://git.infradead.org/users/segher/buildall.git
I use that for testing out lots of gcc versions across architectures
when I run into a bug that is version specific.
> My gcc used to be
> http://releases.linaro.org/14.04/components/toolchain/binaries/gcc-linaro-arm-linux-gnueabihf-4.8-2014.04_linux.tar.xz
> but during the bug chase I switched to Debian's cross-gcc. It says it is
> "arm-linux-gnueabihf-gcc ( 4.9.2-10) 4.9.2".
>
> A kernel compiled by someone else with some "gcc-4.8
> 4.8.4-2ubuntu1~14.04" also had the same oopsing-behavior.
ok. Both of them are a few years old, so it would be interesting to
try a new version, probably not so interesting to try anything older.
If you like, I can also send you a tarball with gcc-6 binaries
> > - If it happens with an unpatched linux-4.6 and all gcc versions
> > with multi_v7_defconfig, we should try reproducing it on some
> > other quad-core Cortex-A9 (non-imx) platform with the same kernel
> > binary.
>
> I will try to reproduce it, as said it can take between 3 to 24 hours.
>
> I'll be using
>
> - vanilla 4.6 without any patch
> - ... and without any device-specific kernel module
> - multi_v7_defconfig
> - Debian 4.9.2 cross-compiler
awesome.
> [ 2.844408] Bluetooth: L2CAP socket layer initialized
> [ 2.849514] Bluetooth: SCO socket layer initialized
> [ 2.895099] usbcore: registered new interface driver btusb
> [ 3.257897] SMSC LAN8710/LAN8720 2188000.ethernet:00: attached PHY driver [SMSC LAN8710/LAN8720] (mii_bus:phy_addr=2188000.ethernet:00, irq=-1)
> [17827.758966] Unable to handle kernel paging request at virtual address 00001014
0x1014 is a rather long offset, but still a plausible NULL pointer. The r10 register
contains 0x1000, so this is probably an incorrect value and is being used as a pointer
with offset 0x14 added in. With an objdump of your net/core/dev.o, we could
see which pointer that was, to maybe figure out which lock is supposed to protect
it and where that lock gets released.
> [17827.766279] pgd = ee09c000
> [17827.769003] [00001014] *pgd=3eba3831, *pte=00000000, *ppte=00000000
> [17827.775383] Internal error: Oops: 17 [#1] SMP ARM
> [17827.780108] Modules linked in: usbhid btusb btrtl btbcm btintel bluetooth flexcan smsc95xx usbnet mii ptxc(O)
> [17827.790242] CPU: 1 PID: 372 Comm: stress-ng-socke Tainted: G O 4.5.4 #1
> [17827.797995] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [17827.804536] task: ed614780 ti: eebba000 task.ti: eebba000
> [17827.809977] PC is at __netif_receive_skb_core+0x328/0xa9c
Unfortunately in the middle of a rather long function, and I don't
see a spin_unlock in this function, in fact it's not even called
with a spinlock held, so it must be something more indirect.
> [17828.297390] r4:ef7ba1b8 r3:ef7ba148
> [17828.301008] [<c03aaa30>] (process_backlog) from [<c03aa2cc>] (net_rx_action+0x1fc/0x2f0)
> [17828.309101] r10:eebbbaf0 r9:00000040 r8:c05ea100 r7:0000012c r6:001abec9 r5:c03aaa30
> [17828.317014] r4:ef7ba1b8 r3:eebbbaf0
> [17828.320637] [<c03aa0d0>] (net_rx_action) from [<c0022444>] (__do_softirq+0x134/0x254)
> [17828.328470] r10:c05ea080 r9:40000003 r8:00000101 r7:eebba000 r6:c05ea08c r5:00000003
> [17828.336381] r4:00000000
> [17828.338937] [<c0022310>] (__do_softirq) from [<c0022624>] (do_softirq+0x68/0x70)
> [17828.346334] r10:00000006 r9:ef06a800 r8:0000000e r7:00000000 r6:eeb1c200 r5:ee05ba30
> [17828.354238] r4:60030013
> [17828.356797] [<c00225bc>] (do_softirq) from [<c00226e8>] (__local_bh_enable_ip+0xbc/0xd0)
> [17828.364890] r4:00000200 r3:00000008
> [17828.368520] [<c002262c>] (__local_bh_enable_ip) from [<c03e0360>] (ip_finish_output2+0x1b4/0x3b8)
> [17828.377395] r5:ee05ba30 r4:00000000
> [17828.381005] [<c03e01ac>] (ip_finish_output2) from [<c03e1ed4>] (ip_finish_output+0x14c/0x20c)
> [17828.389531] r8:ee2d2c00 r7:ee2d2c00 r6:c0606280 r5:0000ffff r4:ee05ba30
> [17828.396321] [<c03e1d88>] (ip_finish_output) from [<c03e2a34>] (ip_output+0x124/0x130)
> [17828.404153] r10:00000006 r9:ef06a800 r8:ee2d2c00 r7:00000000 r6:0971f1b9 r5:c0606280
> [17828.412061] r4:ee05ba30
This might be the smoking gun: We are in the middle of do_softirq() as triggered
from a __local_bh_enable_ip. This is often called from __raw_spin_unlock_bh(),
though I don't see that call in ip_finish_output2.
http://lwn.net/Articles/687617/ has some explanation about how this mechanism
works, and what some recent plans to change that are. I don't immediately see
anything that could lead to your bug there, but it's a start.
Arnd
More information about the linux-arm-kernel
mailing list