[PATCH v2 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort

Doug Anderson dianders at chromium.org
Wed Jun 10 15:50:32 PDT 2026


Hi,

On Tue, Jun 9, 2026 at 6:58 AM Kiryl Shutsemau <kirill at shutemov.name> wrote:
>
> @@ -910,6 +911,35 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
>  #endif
>  }
>
> +#ifdef CONFIG_ARM_SDEI_NMI
> +/*
> + * Stop entry for the SDEI cross-CPU NMI service: its event-0 handler
> + * lands here when this CPU was asked to stop. The bookkeeping mirrors
> + * the IPI_CPU_STOP{,_NMI} handling; the park happens inside the SDEI
> + * event, which is never completed -- completing it would have firmware
> + * resume the interrupted (typically wedged) context. No PSCI CPU_OFF
> + * either: powering off a PE that EL3 still considers mid-event invites
> + * firmware trouble.
> + */
> +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs)
> +{
> +       unsigned int cpu = smp_processor_id();
> +
> +       local_daif_mask();
> +
> +       if (IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop)
> +               crash_save_cpu(regs, cpu);
> +
> +       /* the ack the stop requester polls for */
> +       set_cpu_online(cpu, false);
> +
> +       sdei_mask_local_cpu();
> +
> +       cpu_park_loop();
> +}
> +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> +#endif

Can we combine everything into one function so we don't have to keep
all the different stop functionality in sync? Like this (untested):

void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
{
  unsigned int cpu = smp_processor_id();
  bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;

  /*
   * Use local_daif_mask() instead of local_irq_disable() to make sure
   * that pseudo-NMIs are disabled. The "stop" code starts with
   * an IRQ and falls back to NMI (which might be pseudo). If the IRQ
   * finally goes through right as we're timing out then the NMI could
   * interrupt us. It's better to prevent the NMI and let the IRQ
   * finish since the pt_regs will be better.
   */
  local_daif_mask();

  if (crash)
    crash_save_cpu(regs, cpu);

  set_cpu_online(cpu, false);

  sdei_mask_local_cpu();

  if (crash && die_on_crash)
    __cpu_try_die(cpu);

  /* just in case */
  cpu_park_loop();
}

Then in do_handle_IPI(), it's just:

  case IPI_CPU_STOP:
  case IPI_CPU_STOP_NMI:
    arm64_nmi_cpu_stop(get_irq_regs(), true);
    break;

...and from the SDEI code you pass "false" for "die_on_crash".

Perhaps when doing that you'd stop unconditionally getting the cpu in
do_handle_IPI() and just call it for `IPI_KGDB_ROUNDUP` since it would
now be the only consumer of that local variable.

FWIW, I'm not totally sure I followed the logic for why "die_on_crash"
needs to be "false" for the SDEI case, but I also am not terribly
familiar with KEXEC and crash-dumping kernels, since I've never
attempted to get that feature to work.


> @@ -1263,6 +1293,29 @@ void smp_send_stop(void)
>                         udelay(1);
>         }
>
> +       /*
> +        * If CPUs are *still* online, try the SDEI cross-CPU NMI. Firmware
> +        * delivers it regardless of the target's DAIF state, so it reaches
> +        * a CPU spinning with interrupts masked, which neither rung above
> +        * could (without pseudo-NMI there is no NMI rung at all). Allow
> +        * 100ms: a firmware round-trip per CPU, with headroom.
> +        */
> +       if (num_other_online_cpus()) {
> +               /* re-snapshot after the rungs above took CPUs offline */
> +               smp_rmb();
> +               cpumask_copy(&mask, cpu_online_mask);
> +               cpumask_clear_cpu(smp_processor_id(), &mask);
> +
> +               if (sdei_nmi_stop_cpus(&mask)) {
> +                       pr_info("SMP: retry stop with SDEI NMI for CPUs %*pbl\n",
> +                               cpumask_pr_args(&mask));

Perhaps it's being a bit pedantic, but it's a little weird that you're
printing a message that sounds like "I'm going to retry with SDEI"
after you've already done it. It feels like it would be nominally
cleaner (and more parellel with the pseudo-NMI case) if you could
separately test if SDEI is available. Then sdei_nmi_stop_cpus() would
just return void?


> @@ -59,8 +64,45 @@ static bool sdei_nmi_available;
>
>  #define SDEI_NMI_EVENT                 0
>
> +/*
> + * Stop-request dispatch lives on the same SDEI event 0 as everything
> + * else. The requesting CPU sets each target's bit in sdei_nmi_stop_mask
> + * before signalling event 0; the target's handler test-and-clears its
> + * bit and hands the CPU to arm64_nmi_cpu_stop(), which saves crash
> + * state when the stop is a kdump crash-stop, marks the CPU offline
> + * (which is what the requester polls for) and parks it.
> + *
> + * This mirrors the cpumask the framework's nmi_cpu_backtrace() consults
> + * just below, and a shared mask rather than a separate SDEI event avoids
> + * extra registrations from firmware.
> + */

Do you have any reasoning for why you don't pick a separate EVENT ID
for "backtrace" vs. "stop". If you absolutely have to share an ID
because they're a limited resource then I guess it's fine, but it
would make the code easier to understand / reason about if they were
separate IDs.

If you had a separate EVENT ID, then it seems like you could
completely eliminate the (potentially large) `sdei_nmi_stop_mask`
variable, right? Any time a "STOP" event fires you can unconditionally
consider it to be a stop w/ no globals needed, right?


> @@ -115,6 +157,35 @@ bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
>         return true;
>  }
>
> +/*
> + * Last rung of the stop escalation in smp_send_stop() (see
> + * arch/arm64/kernel/smp.c). The caller runs the regular stop IPI (and
> + * the pseudo-NMI stop IPI, where available) first; @mask holds whatever
> + * stayed online through those -- typically CPUs wedged with interrupts
> + * masked, unreachable by an IPI. Set each target's stop-request flag and
> + * signal event 0 at it; a target acks by marking itself offline, which
> + * the caller polls for.
> + *
> + * Returns false when SDEI isn't active, so the caller can skip the wait.
> + */
> +bool sdei_nmi_stop_cpus(const cpumask_t *mask)
> +{
> +       unsigned int cpu;
> +
> +       if (!sdei_nmi_available)
> +               return false;
> +
> +       cpumask_or(&sdei_nmi_stop_mask, &sdei_nmi_stop_mask, mask);

As per above, hopefully we can get rid of `sdei_nmi_stop_mask`. ...but
if we keep it, I'm curious why "or" and not "copy"?



More information about the kexec mailing list