[PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic

Chris Metcalf cmetcalf at tilera.com
Thu Nov 10 10:11:48 EST 2011

On 11/10/2011 9:22 AM, Michael Holzheu wrote:
> On Wed, 2011-11-09 at 16:04 -0800, Andrew Morton wrote:
>> On Thu, 03 Nov 2011 11:07:24 +0100
>> Michael Holzheu<holzheu at linux.vnet.ibm.com>  wrote:
> [snip]
>> Ho hum, I guess we stick with the original patch.  It *should* work, as
>> long as all archtectures are doing the expected thing.  But in this
>> situation it is bad of us to just hope that the architectures are doing
>> this.  We should go and find out, rather than waiting for bug reports
>> to come in.  Especially because in this case, bugs will take a very
>> long time indeed to even be noticed.
>> One way to resolve this would be to ask the various arch maintainers!
> Hello arch maintainers (from scripts/get_maintainer.pl),
> Andrew asked me to contact you in this case.
> The main concern of the patch below is that smp_send_stop() might not be
> able to stop irq-disabled CPUs. So when two CPUs enter in parallel
> panic() and the 2nd one has irqs disabled, with my patch below, perhaps
> the 2nd CPU can't be stopped. On s390 and also on x86 (with a patch from
> Don Zickus) this is not a problem.

On tile the smp_send_stop() is delivered via IPIs that respect irq 
disabling, i.e. we wouldn't handle the message on the 2nd cpu in your 
scenario above.

This may not be a problem on many architectures, though.  If one or more 
cpus is blocked in spin_lock(), that may be just as effective from a 
"machine halt" point of view as if those cpus had handled the smp_stop_cpu 
interrupt, which on tile just leaves the cpu with interrupts disabled 
anyway, though sitting on a lower-power "nap" instruction rather than 
spinning trying to acquire the lock.  (It may also be the case that on some 
architectures you need to have shepherded all the cpus into the "machine 
halt" state before you can reboot them, though that's not true on tile.)

If a cleaner API seems useful (either for power reasons or restartability 
or whatever), I suppose a standard global function name could be specified 
that's the thing you execute when you get an smp_send_stop IPI (in tile's 
case it's "smp_stop_cpu_interrupt()") and the panic() code could instead 
just do an atomic_inc_return() of a global panic counter, and if it wasn't 
the first panicking cpu, call directly into the smp_stop handler routine to 
quiesce itself.  Then the panicking cpu could finish whatever it needs to 
do and then halt, reboot, etc., all the cpus.

For what it's worth we do see the condition sometimes when a bunch of cpus 
try to panic near-simultaneously and you get crazy interleaved panic 
output, so I'd certainly support some patch of this nature.
Chris Metcalf, Tilera Corp.

More information about the kexec mailing list