[PATCH v2] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first

Doug Anderson dianders at chromium.org
Mon Aug 5 10:13:11 PDT 2024


Hi,

On Mon, Aug 5, 2024 at 9:53 AM Will Deacon <will at kernel.org> wrote:
>
> Hi Doug,
>
> On Tue, Jun 25, 2024 at 04:07:22PM -0700, Douglas Anderson wrote:
> > @@ -1084,79 +1088,87 @@ static inline unsigned int num_other_online_cpus(void)
> >
> >  void smp_send_stop(void)
> >  {
> > +     static unsigned long stop_in_progress;
> > +     cpumask_t mask;
> >       unsigned long timeout;
> >
> > -     if (num_other_online_cpus()) {
> > -             cpumask_t mask;
> > +     /*
> > +      * If this cpu is the only one alive at this point in time, online or
> > +      * not, there are no stop messages to be sent around, so just back out.
> > +      */
> > +     if (num_other_online_cpus() == 0)
> > +             goto skip_ipi;
> >
> > -             cpumask_copy(&mask, cpu_online_mask);
> > -             cpumask_clear_cpu(smp_processor_id(), &mask);
> > +     /* Only proceed if this is the first CPU to reach this code */
> > +     if (test_and_set_bit(0, &stop_in_progress))
> > +             return;
> >
> > -             if (system_state <= SYSTEM_RUNNING)
> > -                     pr_crit("SMP: stopping secondary CPUs\n");
> > -             smp_cross_call(&mask, IPI_CPU_STOP);
> > -     }
> > +     cpumask_copy(&mask, cpu_online_mask);
> > +     cpumask_clear_cpu(smp_processor_id(), &mask);
> >
> > -     /* Wait up to one second for other CPUs to stop */
> > +     if (system_state <= SYSTEM_RUNNING)
> > +             pr_crit("SMP: stopping secondary CPUs\n");
> > +
> > +     /*
> > +      * Start with a normal IPI and wait up to one second for other CPUs to
> > +      * stop. We do this first because it gives other processors a chance
> > +      * to exit critical sections / drop locks and makes the rest of the
> > +      * stop process (especially console flush) more robust.
> > +      */
> > +     smp_cross_call(&mask, IPI_CPU_STOP);
>
> I realise you've moved this out of crash_smp_send_stop() and it looks
> like we inherited the code from x86, but do you know how this serialise
> against CPU hotplug operations? I've spent the last 20m looking at the
> code and I can't see what prevents other CPUs from coming and going
> while we're trying to IPI a non-atomic copy of 'cpu_online_mask'.

I don't think there is anything. ...and it's not just this code
either. It sure looks like nmi_trigger_cpumask_backtrace() has the
same problem.

I guess maybe in the case of nmi_trigger_cpumask_backtrace() it's not
such a big deal because:
1. If a CPU goes away then we'll just time out
2. If a CPU shows up then we'll skip backtracing it, but we were
sending backtraces at an instant in time anyway.

In the case of smp_send_stop() it's probably fine if a CPU goes away
because, again, we'll just timeout. ...but if a CPU shows up then
that's not super ideal. Maybe it doesn't cause problems in practice
but it does feel like it should be fixed.

-Doug



More information about the linux-arm-kernel mailing list