[patch] ARM: smpboot: Enable interrupts after marking CPU online/active
Santosh
santosh.shilimkar at ti.com
Fri Sep 9 00:17:07 EDT 2011
On Friday 09 September 2011 03:27 AM, Thomas Gleixner wrote:
> Frank Rowand reported:
>
> I have a consistent (every boot) hang on boot with the RT patches.
> With a few hacks to get console output, I get:
>
> rcu_preempt_state detected stalls on CPUs/tasks
>
> I have also replicated the problem on the ARM RealView (in tree) and
> without the RT patches.
>
> The problem ended up being caused by the allowed cpus mask being set
> to all possible cpus for the ksoftirqd on the secondary processors.
> So the RCU softirq was never executing on the secondary cpu.
>
> The problem was that ksoftirqd was woken on the secondary processors before
> the secondary processors were online. This led to allowed cpus being set
> to all cpus.
>
> wake_up_process()
> try_to_wake_up()
> select_task_rq()
> if (... || !cpu_online(cpu))
> select_fallback_rq(task_cpu(p), p)
> ...
> /* No more Mr. Nice Guy. */
> dest_cpu = cpuset_cpus_allowed_fallback(p)
> do_set_cpus_allowed(p, cpu_possible_mask)
> # Thus ksoftirqd can now run on any cpu...
> </report>
>
> The reason is that the ARM SMP boot code for the secondary CPUs enables
> interrupts before the newly brought up CPU is marked online and
> active.
>
> That causes a wakeup of ksoftirqd or a wakeup of any other kernel
> thread which is affine to the brought up CPU break that threads
> affinity and therefor being scheduled on already online CPUs.
>
> This problem has been observed on x86 before and the only solution is
> to mark the CPU online and wait for the CPU active bit before the
> point where interrupts are enabled.
>
> This is safe as the percpu timer setup and the calibration code are
> not part of the critical setup path and the calibration code needs to
> have interrupts enabled anyway. We cannot schedule away at this point
> because we are still in the preempt disabled region which is released
> in cpu_idle().
>
> Reported-and-tested-by: Frank Rowand<frank.rowand at am.sony.com>
> Link:http://lkml.kernel.org/r/alpine.LFD.2.02.1109071115410.2723@ionos
> Signed-off-by: Thomas Gleixner<tglx at linutronix.de>
A while back, while debugging a CPU ONLINE issue, I cooked up the
similar patch based on the above race condition.
https://lkml.org/lkml/2011/6/20/79
But the issue I was facing was slightly different and that got sorted
out with fixing the re-calibration code.
Good to see that we have a test case which proves the race conditions,
I was describing.
Regards
Santosh
More information about the linux-arm-kernel
mailing list