ARM board lockups/hangs triggered by locks and mutexes

Florian Fainelli florian.fainelli at broadcom.com
Tue Aug 1 15:25:09 PDT 2023


Hi Rafal,

On 8/1/23 15:10, Rafał Miłecki wrote:
> Hi,
> 
> Years ago I added support for Broadcom's BCM53573 SoCs. We released
> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
> fine. There was one little issue we couldn't debug or fix: random hangs
> and reboots. They were too rare to deal with (most devices worked fine
> for weeks or months).
> 
> Recently I updated my stable kernel 5.4 and I started experiencing
> stability issues on my own! After some uptime (usually from 0 to 20
> minutes of close to zero activity) serial console hangs. I can't type
> anything and I stop getting any messages. I've to wait about a minute
> for watchdog to kick in and reboot device.
> 
> #####
> 
> I took that great chance and decided to track the regression.
> 
> Linux 5.4 stable branch worked stable up to the release v5.4.197.
> Starting with v5.4.198 I started experiencing those stability issues. I
> bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
> with loops over idev->addr_list"):
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e
> 
> With above commit reverted I was able to use stable 5.4 branch up to the
> release v5.4.207. Starting with v5.4.208 it got unstable again. I
> bisected it down to:
> commit d0d583484d2e ("locking/refcount: Consolidate implementations of
> refcount_t")
> commit dab787c73f6e ("locking/refcount: Consolidate
> REFCOUNT_{MAX,SATURATED} definitions")
> commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
> commit 809554147d60 ("locking/refcount: Improve performance of generic
> REFCOUNT_FULL code")
> commit 9c9269977f03 ("locking/refcount: Move the bulk of the
> REFCOUNT_FULL implementation into the <linux/refcount.h> header")
> commit 04bff7d7b808 ("locking/refcount: Remove unused
> refcount_*_checked() variants")
> commit 513b19a43bec ("locking/refcount: Ensure integer operands are
> treated as signed")
> commit 68b4ee68e8c8 ("locking/refcount: Define constants for
> saturation and max refcount values")
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
> (I didn't actually check above commits individually).
> 
> Reverting above locking/refcount commits worked fine for few releases:
> up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
> that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
> sleepable contexts").
> 
> Reverting that extra commit from v5.4.238 allows me to run Linux for
> hours again (currently 3 devices x 6 hours and counting). So I need in
> total 10+1 reverts from 5.4 branch to get a stable kernel.
> 
> #####
> 
> I'm clueless at this point. Is that possible kernel has some locking bug
> I can hit only using this specific SoC? BCM53573s have a single ARM
> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
> can think of is a slow arch timer running at 36,8 kHz.

 From the look of it, it seems like the CPU might have bugs with atomics?

Your log indicates that your Cortex-A7 is r0p5 which is described to be 
susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, 
if not, can you enable it and see if makes any difference?

> 
> I tried compiling kernel with:
> CONFIG_SOFTLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_WQ_WATCHDOG=y
> but it didn't change or report anything.
> 
> Unfortunately enabling *any* of following options:
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> seems to make locksup/hangs go away. I tried for few hours.
> 
> Sadly I don't have access to JTAG or any low level debugging interface.
> 
> Does looking at commits I reported above give anyone a hint on what may
> be going on maybe?
> 

-- 
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4221 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20230801/1e4509e0/attachment.p7s>


More information about the linux-arm-kernel mailing list