[PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
Chris Metcalf
cmetcalf at tilera.com
Thu Nov 10 10:11:48 EST 2011
On 11/10/2011 9:22 AM, Michael Holzheu wrote:
> On Wed, 2011-11-09 at 16:04 -0800, Andrew Morton wrote:
>> On Thu, 03 Nov 2011 11:07:24 +0100
>> Michael Holzheu<holzheu at linux.vnet.ibm.com> wrote:
> [snip]
>
>> Ho hum, I guess we stick with the original patch. It *should* work, as
>> long as all archtectures are doing the expected thing. But in this
>> situation it is bad of us to just hope that the architectures are doing
>> this. We should go and find out, rather than waiting for bug reports
>> to come in. Especially because in this case, bugs will take a very
>> long time indeed to even be noticed.
>>
>> One way to resolve this would be to ask the various arch maintainers!
> Hello arch maintainers (from scripts/get_maintainer.pl),
>
> Andrew asked me to contact you in this case.
>
> The main concern of the patch below is that smp_send_stop() might not be
> able to stop irq-disabled CPUs. So when two CPUs enter in parallel
> panic() and the 2nd one has irqs disabled, with my patch below, perhaps
> the 2nd CPU can't be stopped. On s390 and also on x86 (with a patch from
> Don Zickus) this is not a problem.
On tile the smp_send_stop() is delivered via IPIs that respect irq
disabling, i.e. we wouldn't handle the message on the 2nd cpu in your
scenario above.
This may not be a problem on many architectures, though. If one or more
cpus is blocked in spin_lock(), that may be just as effective from a
"machine halt" point of view as if those cpus had handled the smp_stop_cpu
interrupt, which on tile just leaves the cpu with interrupts disabled
anyway, though sitting on a lower-power "nap" instruction rather than
spinning trying to acquire the lock. (It may also be the case that on some
architectures you need to have shepherded all the cpus into the "machine
halt" state before you can reboot them, though that's not true on tile.)
If a cleaner API seems useful (either for power reasons or restartability
or whatever), I suppose a standard global function name could be specified
that's the thing you execute when you get an smp_send_stop IPI (in tile's
case it's "smp_stop_cpu_interrupt()") and the panic() code could instead
just do an atomic_inc_return() of a global panic counter, and if it wasn't
the first panicking cpu, call directly into the smp_stop handler routine to
quiesce itself. Then the panicking cpu could finish whatever it needs to
do and then halt, reboot, etc., all the cpus.
For what it's worth we do see the condition sometimes when a bunch of cpus
try to panic near-simultaneously and you get crazy interleaved panic
output, so I'd certainly support some patch of this nature.
--
Chris Metcalf, Tilera Corp.
http://www.tilera.com
More information about the kexec
mailing list