[PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort

Puranjay Mohan puranjay12 at gmail.com
Mon Jun 15 03:25:17 PDT 2026


On Mon, Jun 15, 2026 at 4:36 AM Kiryl Shutsemau <kirill at shutemov.name> wrote:
>
> From: "Kiryl Shutsemau (Meta)" <kas at kernel.org>
>
> A CPU wedged with interrupts masked ignores the stop IPI, and without
> pseudo-NMI there is no NMI IPI to escalate to: a reboot proceeds with
> the CPU still running, and a kdump misses its registers.
>
> Add a third rung to smp_send_stop(): once the IPI (and pseudo-NMI IPI,
> if enabled) rungs have run, signal SDEI event 0 at whatever stayed
> online. Firmware delivers it regardless of the target's DAIF, so it
> reaches a CPU a plain IPI cannot; the target acks by going offline,
> which the caller already polls for.
>
> Fold the stop bookkeeping into one arm64_nmi_cpu_stop(regs,
> die_on_crash), shared by the stop IPI handlers, panic_smp_self_stop()
> and the SDEI handler, replacing the near-duplicate local_cpu_stop() and
> ipi_cpu_crash_stop(). @die_on_crash is the only difference: the IPI
> handlers pass true and PSCI CPU_OFF the CPU on a crash stop so a capture
> kernel can reclaim it; the SDEI handler and self-stop pass false and
> park. The SDEI park is required, not conservative -- its handler runs
> inside an SDEI event that is never completed (completing it resumes the
> wedged context), and a CPU_OFF from that unfinished-event context wedges
> EL3 on some firmware (left as a follow-up). The dump is unaffected; only
> re-onlining the CPU in an SMP capture kernel is lost.
>
> Suggested-by: Douglas Anderson <dianders at chromium.org>
> Signed-off-by: Kiryl Shutsemau (Meta) <kas at kernel.org>
> ---
>  arch/arm64/include/asm/nmi.h    |  24 +++++++
>  arch/arm64/kernel/smp.c         | 109 +++++++++++++++++++++-----------
>  drivers/firmware/Kconfig        |   2 +
>  drivers/firmware/arm_sdei_nmi.c |  75 ++++++++++++++++++++++
>  4 files changed, 172 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h
> index 9366be419d18..2e8974ff8d63 100644
> --- a/arch/arm64/include/asm/nmi.h
> +++ b/arch/arm64/include/asm/nmi.h
> @@ -4,21 +4,45 @@
>
>  #include <linux/cpumask.h>
>
> +struct pt_regs;
> +
>  /*
>   * Cross-CPU NMI provider hooks, consulted by the arm64 arch code before
>   * its regular-IRQ / pseudo-NMI IPI paths. The SDEI provider in
>   * drivers/firmware/arm_sdei_nmi.c implements them when active; a future
>   * FEAT_NMI provider could slot in here too. The stubs let callers stay
>   * unconditional when ARM_SDEI_NMI is off.
> + *
> + * sdei_nmi_active() lets a caller test for the service before committing
> + * to (and waiting on) the SDEI stop rung; sdei_nmi_stop_cpus() then signals
> + * the targets, which ack by going offline.
>   */
>  #ifdef CONFIG_ARM_SDEI_NMI
>  bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu);
> +bool sdei_nmi_active(void);
> +void sdei_nmi_stop_cpus(const cpumask_t *mask);
>  #else
>  static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
>                                                       int exclude_cpu)
>  {
>         return false;
>  }
> +
> +static inline bool sdei_nmi_active(void)
> +{
> +       return false;
> +}
> +
> +static inline void sdei_nmi_stop_cpus(const cpumask_t *mask) { }
>  #endif
>
> +/*
> + * The common "stop this CPU" entry every arm64 stop path funnels through:
> + * the regular/pseudo-NMI stop IPI handlers, panic_smp_self_stop(), and the
> + * SDEI cross-CPU NMI handler. @die_on_crash powers the CPU off on the kdump
> + * crash path (IPI handlers) instead of parking it (SDEI / self-stop).
> + * Defined in arch/arm64/kernel/smp.c.
> + */
> +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash);
> +
>  #endif /* __ASM_NMI_H */
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index a670434a8cae..e85a4ba18d5c 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -33,6 +33,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/kexec.h>
>  #include <linux/kgdb.h>
> +#include <linux/kprobes.h>
>  #include <linux/kvm_host.h>
>  #include <linux/nmi.h>
>
> @@ -862,14 +863,58 @@ void arch_irq_work_raise(void)
>  }
>  #endif
>
> -static void __noreturn local_cpu_stop(unsigned int cpu)
> +/*
> + * Bring the local CPU to a stop, saving its register state into the vmcore
> + * on the kdump crash path first. The single point every arm64 stop path
> + * funnels through, so the bookkeeping (mask interrupts, mark offline, mask
> + * SDEI, optionally power off) lives in one place:
> + *
> + *   - the regular IPI_CPU_STOP and pseudo-NMI IPI_CPU_STOP_NMI handlers;
> + *   - panic_smp_self_stop(), a CPU parking itself on a parallel panic();
> + *   - the SDEI cross-CPU NMI handler (drivers/firmware/arm_sdei_nmi.c),
> + *     which reaches CPUs the stop IPIs could not.
> + *
> + * @regs is the register state to record in the vmcore on a crash stop; NULL
> + * means "capture the current context". @die_on_crash decides the kdump crash
> + * path: the IPI stop handlers pass true and power the CPU off (PSCI CPU_OFF,
> + * via __cpu_try_die()) so a capture kernel can reclaim it. The SDEI handler
> + * and panic_smp_self_stop() pass false and only park. For SDEI that is
> + * required, not just conservative: it runs inside an SDEI event that is
> + * deliberately never completed (completing it has firmware resume the wedged
> + * context), and a CPU_OFF from that not-yet-completed context wedges EL3 on
> + * some firmware -- a documented follow-up. Parking also matches this path's
> + * own fallback when CPU_OFF is unavailable.
> + */
> +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
>  {
> +       unsigned int cpu = smp_processor_id();
> +       bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;
> +
> +       /*
> +        * Use local_daif_mask() instead of local_irq_disable() to make sure
> +        * that pseudo-NMIs are disabled. The "stop" code starts with an IRQ
> +        * and falls back to NMI (which might be pseudo). If the IRQ finally
> +        * goes through right as we're timing out then the NMI could interrupt
> +        * us. It's better to prevent the NMI and let the IRQ finish since the
> +        * pt_regs will be better.
> +        */
> +       local_daif_mask();
> +
> +       if (crash)
> +               crash_save_cpu(regs, cpu);
> +
> +       /* the ack a stop requester (e.g. smp_send_stop()) polls for */
>         set_cpu_online(cpu, false);
>
> -       local_daif_mask();
>         sdei_mask_local_cpu();
> +
> +       if (crash && die_on_crash)
> +               __cpu_try_die(cpu);
> +
> +       /* just in case */
>         cpu_park_loop();
>  }
> +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
>
>  /*
>   * We need to implement panic_smp_self_stop() for parallel panic() calls, so
> @@ -878,36 +923,7 @@ static void __noreturn local_cpu_stop(unsigned int cpu)
>   */
>  void __noreturn panic_smp_self_stop(void)
>  {
> -       local_cpu_stop(smp_processor_id());
> -}
> -
> -static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> -{
> -#ifdef CONFIG_KEXEC_CORE
> -       /*
> -        * Use local_daif_mask() instead of local_irq_disable() to make sure
> -        * that pseudo-NMIs are disabled. The "crash stop" code starts with
> -        * an IRQ and falls back to NMI (which might be pseudo). If the IRQ
> -        * finally goes through right as we're timing out then the NMI could
> -        * interrupt us. It's better to prevent the NMI and let the IRQ
> -        * finish since the pt_regs will be better.
> -        */
> -       local_daif_mask();
> -
> -       crash_save_cpu(regs, cpu);
> -
> -       set_cpu_online(cpu, false);
> -
> -       sdei_mask_local_cpu();
> -
> -       if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
> -               __cpu_try_die(cpu);
> -
> -       /* just in case */
> -       cpu_park_loop();
> -#else
> -       BUG();
> -#endif
> +       arm64_nmi_cpu_stop(NULL, false);
>  }

panic_smp_self_stop() passes regs == NULL. If a second CPU panics
after the primary has already set crash_stop, it loses the panic_cpu
cmpxchg and calls panic_smp_self_stop(); if it was running with
interrupts masked it never took the stop IPI, so it gets here with
crash_stop == 1. crash is then true and we do crash_save_cpu(NULL,
cpu), which ends up in elf_core_copy_regs(), and on arm64 that is just

        *(struct user_pt_regs *)&(dest) = (regs)->user_regs;

so a straight NULL deref -> synchronous abort while we're in the
middle of crashing. The old local_cpu_stop() never called
crash_save_cpu(), so this is a new regression from the unification.

The comment above the function says NULL means "capture the current
context", but crash_save_cpu() doesn't do that, it just dereferences
regs. If that's the intent, materialise it:

        if (crash) {
                struct pt_regs local;

                if (!regs) {
                        crash_setup_regs(&local, NULL);
                        regs = &local;
                }
                crash_save_cpu(regs, cpu);
        }

  crash_setup_regs(..., NULL) is the existing "capture current" helper. Or just
  skip the save when regs is NULL if the self-stop registers aren't
worth having.



More information about the linux-arm-kernel mailing list