Possible kernel bug in torvalds/linux/master

Tony Lindgren tony at atomide.com
Sun Mar 25 08:19:04 PDT 2018


Hi,

* Arnd Bergmann <arnd at arndb.de> [180325 13:30]:
> On Sun, Mar 25, 2018 at 3:03 PM, Christophe Lyon
> <christophe.lyon at linaro.org> wrote:
> > Hi Arnd,
> >
> > We have a Jenkins jobs that builds the kernel from torvalds/linux
> > master branch mutli_v7 defconfig every day, using our last GCC release
> > (7.2-2017-11), and boots a beaglebone-black board.
> >
> > Last week it started to fail, I first suspected a Lava problem, but
> > the job now fails every time, and Remi Duraffort from the Lava team
> > thinks it's really a kernel problem.
> >
> > Is this something you are interested in investigating? Or should we
> > switch to another "less-edge" branch?
> >
> > The last successful run:
> > https://ci.linaro.org/job/tcwg-buildapp/app=linux+multi_v7,label=tcwg-x86_64-build,target=arm-linux-gnueabihf/75/
> > The next one failed:
> > https://ci.linaro.org/job/tcwg-buildapp/app=linux+multi_v7,label=tcwg-x86_64-build,target=arm-linux-gnueabihf/76
> >
> > Build 75 was with this kernel commit:
> > Merge branch 'for-4.16-fixes'
> > 1b5f3ba415fe4cf8b8b39c8d104ed44cde330658
> >
> > Build 76 was with:
> > Merge tag 'clk-fixes-for-linus'
> > 3215b9d57a2c75c4305a3956ca303d7004485200
> 
> Hi Christophe,
> 
> This branch is certainly the right one to test, thanks for the report!
> From looking at the output above, it seems that the kernel no longer
> boots at all, and fails to even print any messages. Between the
> two runs, I see the following commits:
> 
> 3215b9d57a2c Merge tag 'clk-fixes-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
> 303851e14a8f Merge tag 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
> 76c0b6a36a12 Merge tag 'scsi-fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> 645102eac15e Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux
> 32d43cd391ba kvm/x86: fix icebp instruction handling
> e8980d67d601 RDMA/ucma: Ensure that CM_ID exists prior to access it
> 68ef3bc31664 nfsd: remove blocked locks on client teardown
> 80cf79ae4f68 RDMA/verbs: Remove restrack entry from XRCD structure
> ed65a4dc2208 RDMA/ucma: Fix use-after-free access in ucma_close
> 7997f3b2df75 clk: bcm2835: Protect sections updating shared registers
> 49012d1bf5f7 clk: bcm2835: Fix ana->maskX definitions
> 2975d5de6428 RDMA/ucma: Check AF family prior resolving address
> 8a53fc511c5e clk: aspeed: Prevent reset if clock is enabled
> d90c76bb6112 clk: aspeed: Fix is_enabled for certain clocks
> bd8602ca42f6 infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
> 5388a508479d infiniband: qplib_fp: fix pointer cast
> 42cea83f9524 IB/mlx5: Fix cleanup order on unload
> 0c81ffc60d52 RDMA/ucma: Don't allow join attempts for unsupported AF family
> 7688f2c3bbf5 RDMA/ucma: Fix access to non-initialized CM_ID object
> 9dea9a2ff61c RDMA/core: Do not use invalid destination in determining port reuse
> f3f134f5260a RDMA/mlx5: Fix crash while accessing garbage pointer and
> freed memory
> c2b37f76485f IB/mlx5: Fix integer overflows in mlx5_ib_create_srq
> 2c292dbb398e IB/mlx5: Fix out-of-bounds read in create_raw_packet_qp_rq
> 14bc1dff7427 scsi: qla2xxx: Remove FC_NO_LOOP_ID for FCP and FC-NVMe Discovery
> 318aaf34f117 scsi: libsas: defer ata device eh commands to libata
> 55c19eee3b47 clk: qcom: msm8916: Fix return value check in
> qcom_apcs_msm8916_clk_probe()
> 9903e41ae1f5 clk: hisilicon: hi3660:Fix potential NULL dereference in
> hi3660_stub_clk_probe()
> 56e1ee353943 Merge branch 'clk-helpers' (early part) into clk-fixes
> 04bf9ab3359f clk: fix determine rate error with pass-through clock
> 91584eb51b47 Merge branch 'clk-phase' into clk-fixes
> bd13c6cbd3c0 Merge tag 'ti-clk-fixes-4.16' of
> https://github.com/t-kristo/linux-pm into clk-fixes
> a88bb86d58ce Merge tag 'clk-imx-fixes-4.16' of
> git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into
> clk-fixes
> 957a42e8599a Merge tag 'sunxi-clk-fixes-for-4.16' of
> https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into
> clk-fixes
> 99652a469df1 clk: migrate the count of orphaned clocks at init
> 7f95beea3608 clk: update cached phase to respect the fact when setting phase
> 762790b75210 clk: ti: am43xx: add set-rate-parent support for display
> clkctrl clock
> c083dc5f3738 clk: ti: am33xx: add set-rate-parent support for display
> clkctrl clock
> 49159a9dc3da clk: ti: clkctrl: add support for CLK_SET_RATE_PARENT flag
> a275b315334d clk: imx51-imx53: Fix UART4/5 registration on i.MX50 and i.MX53
> 5682e268350f clk: sunxi-ng: a31: Fix CLK_OUT_* clock ops
> 
> Out of these, All the interesting ones are clk related:
> 
> 56e1ee353943 Merge branch 'clk-helpers' (early part) into clk-fixes
> 04bf9ab3359f clk: fix determine rate error with pass-through clock
> 91584eb51b47 Merge branch 'clk-phase' into clk-fixes
> bd13c6cbd3c0 Merge tag 'ti-clk-fixes-4.16' of
> https://github.com/t-kristo/linux-pm into clk-fixes
> 99652a469df1 clk: migrate the count of orphaned clocks at init
> 7f95beea3608 clk: update cached phase to respect the fact when setting phase
> 762790b75210 clk: ti: am43xx: add set-rate-parent support for display
> clkctrl clock
> c083dc5f3738 clk: ti: am33xx: add set-rate-parent support for display
> clkctrl clock
> 49159a9dc3da clk: ti: clkctrl: add support for CLK_SET_RATE_PARENT flag
> 
> I've added the involved parties to Cc. We also see the same thing on
> kernelci, where many OMAP based systems now fail to boot, with the
> problem starting at the same commit:
> 
> https://kernelci.org/boot/all/job/mainline/branch/master/kernel/v4.16-rc6-431-gbcfc1f455466/
> 
> It's possible that this has already been debugged and a fix is being worked on,
> but I'm not aware of anything, since I have not followed my email
> while travelling.

I've confirmed that omap2plus_defconfig boots on bbb while
multi_v7_defconfig fails to boot with the following:

l4_wkup_cm:clk:0010:0: failed to disable
Unhandled fault: external abort on non-linefetch (0x1028) at 0xfa30e054
pgd = 4b21228f
[fa30e054] *pgd=48211452(bad)
Internal error: : 1028 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc6-00075-g3215b9d57a2c #709
Hardware name: Generic AM33XX (Flattened Device Tree)
PC is at _update_sysc_cache+0x2c/0x88
LR is at _enable+0x19c/0x274
pc : [<c032a844>]    lr : [<c032afc8>]    psr: 40000013
sp : db0adea0  ip : 00000003  fp : 00000000
r10: c144997c  r9 : 00000157  r8 : 00000003
r7 : c151d30c  r6 : 00000000  r5 : c1678ef4  r4 : c151b2f0
r3 : fa30e054  r2 : c151b360  r1 : 00000054  r0 : c151b2f0
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 80204019  DAC: 00000051
Process swapper/0 (pid: 1, stack limit = 0x2ddf0754)
Stack: (0xdb0adea0 to 0xdb0ae000)
dea0: c151b2f0 c032afc8 00000000 a0000013 c1504c48 c151b2f0 c151b314 c1504c48
dec0: c151b328 c1311c78 a0000013 c0c15ec4 00000011 edaa6d91 c131297c c151b2f0
dee0: c150ce28 c131297c ffffe000 c1312a68 c1504c48 00000000 c131297c c0302730
df00: dfdffb06 dfdffafa c1250ecc 00000100 00000157 c0361f34 c124f400 c10cc358
df20: 00000000 00000002 00000002 c10dec28 00000000 c1504c48 c10eeca0 c10dec9c
df40: 00000000 dfdffb06 00000000 edaa6d91 00000000 c1677700 c1677700 c13cf824
df60: c13cf83c 00000003 00000157 c144997c 00000000 c1300e2c 00000002 00000002
df80: 00000000 c13005c0 00000000 c0d96788 00000000 00000000 00000000 00000000
dfa0: 00000000 c0d96790 00000000 c03010e8 00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 d5370d56 dcffd777
[<c032a844>] (_update_sysc_cache) from [<c032afc8>] (_enable+0x19c/0x274)
[<c032afc8>] (_enable) from [<c1311c78>] (_setup.part.16+0xd8/0x418)
[<c1311c78>] (_setup.part.16) from [<c1312a68>] (__omap_hwmod_setup_all+0xec/0x100)
[<c1312a68>] (__omap_hwmod_setup_all) from [<c0302730>] (do_one_initcall+0x54/0x18c)
[<c0302730>] (do_one_initcall) from [<c1300e2c>] (kernel_init_freeable+0x144/0x1d0)
[<c1300e2c>] (kernel_init_freeable) from [<c0d96790>] (kernel_init+0x8/0x110)
[<c0d96790>] (kernel_init) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
Exception stack(0xdb0adfb0 to 0xdb0adff8)
dfa0:                                     00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e31c0c01 e5903048 e0833001 1a00000a (e5933000)

Tero, it might be some timing related clock issue?

Regards,

Tony



More information about the linux-arm-kernel mailing list