Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing

Sudeep Holla sudeep.holla at arm.com
Thu Apr 2 10:38:51 PDT 2015



On 02/04/15 15:13, Russell King - ARM Linux wrote:
> On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
>> Not sure on that as v3.18 with DT seems to be working fine and passed
>> overnight reboot testing.
>
> Okay, that suggests there's something post v3.18 which is causing this,
> rather than it being a DT vs non-DT thing.
>

Correct. Just to be 100% sure I reverted that non-DT removal commit on
both v3.19-rc1 and v4.0-rc6 and was able to reproduce issue without DT.

> An extra data point which I've just found (by enabling attempts to do
> hibernation on various test platforms) is that the Versatile Express
> appears to be incapable of taking a CPU offline.
>
> This crashes the entire system with sometimes random results.  Sometimes
> it'll appear that a spinlock has been left owned by CPU#1 which is
> offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
> dumping kernel messages from the start of the kernel's ring buffer (!),
> eg:
>
> PM: freeze of devices complete after 29.342 msecs
> PM: late freeze of devices complete after 6.398 msecs
> PM: noirq freeze of devices complete after 5.493 msecs
> Disabling non-boot CPUs ...
> __cpu_disable(1)
> __cpu_die(1)
> handle_IPI(0)
> Booting Linux on physical CPU 0x0
>
> So far, it's not managed to take a CPU successfully offline and know that
> it has.  If I disable the calls to cpu_enter_lowpower() and
> cpu_leave_lowpower(), then it appears to work.
>
> This leads me to wonder whether flush_cache_louis() works... which led me
> in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
> the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.
>

Yes I observed that and tested for this issue enabling it. It's doesn't
affect and I still hit the issue.

[...]
>
> I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
> confirm whether you have this errata enabled for your tests?
>
I have now gone back to <1 1 1> latency to debug the issue as it's
easier to reproduce with that latencies.

After I failed terribly to bisect between v3.18..v3.19-c1, as it depends
a lot on the config you choose(a lot of changes introduced as it's merge
window), I started looking at the code where we hit this issue since
it's always in __radix_tree_lookup in lib/radix-tree.c while
accessing the slots to see if it provides any more details.

Regards,
Sudeep



More information about the linux-arm-kernel mailing list