[PATCH 4/4] arm64: route crash_smp_send_stop() last resort through SDEI

Fri Jun 5 13:42:57 PDT 2026

Hi,

On Wed, Jun 3, 2026 at 7:36 AM Kiryl Shutsemau <kirill at shutemov.name> wrote:
>
> @@ -1288,8 +1288,32 @@ void crash_smp_send_stop(void)
>                 return;
>         crash_stop = 1;
>
> +       /*
> +        * Stop the normal way first: IPI_CPU_STOP escalating to a pseudo-NMI
> +        * IPI. Every CPU that responds saves its state via crash_save_cpu()
> +        * and parks in cpu_park_loop() with its online bit cleared -- the
> +        * standard kdump stop, identical to a kernel without SDEI. Crucially
> +        * those CPUs stay in a clean, potentially-reusable state.
> +        */
>         smp_send_stop();
>
> +       /*
> +        * Whatever is still online didn't respond -- typically a CPU wedged
> +        * with interrupts masked. The plain IPI can't reach it, and a fleet
> +        * that declines the pseudo-NMI hot-path cost has no NMI IPI to
> +        * escalate to. Hit only the survivors with the SDEI cross-CPU NMI
> +        * (no-op if SDEI isn't active, or if everything already stopped):
> +        * firmware delivers out of EL3 regardless of PSTATE.DAIF, and the
> +        * handler captures crash_save_cpu() state from the wedged context
> +        * before parking the CPU.
> +        *
> +        * SDEI is deliberately last: an SDEI-stopped CPU never completes its
> +        * event (it parks inside the handler, so EL3 retains its dispatch
> +        * slot until reset), which is strictly less recoverable than a normal
> +        * stop. We pay that only for CPUs that left no other way to reach them.
> +        */
> +       sdei_nmi_crash_smp_send_stop();

It feels weird to me that you're adding SDEI for "crash stop" but not
for regular "stop". It feels like you should modify smp_send_stop() to
fall back to SDEI if sending the NMI failed, instead of adding this
separate path.

>  static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg)
>  {
> +       int cpu = smp_processor_id();
> +
> +       if (READ_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested))) {
> +               WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested), 0);
> +
> +               /*
> +                * Capture the wedged context for kdump while pt_regs still
> +                * points at the interrupted PC. This is the main motivation
> +                * for using SDEI here: the plain IPI stop path can't reach an
> +                * interrupt-masked CPU (and the fleet declines pseudo-NMI to
> +                * keep the IRQ-mask hot path cheap), so crash_save_cpu() for
> +                * that CPU would otherwise record nothing useful.
> +                */
> +               crash_save_cpu(regs, cpu);
> +               set_cpu_online(cpu, false);
> +
> +               /* publish the crash state/offline before the requester sees the ack */
> +               smp_wmb();
> +               WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_acked), 1);
> +
> +               /*
> +                * Park forever from within the SDEI handler. We deliberately
> +                * do NOT issue SDEI_EVENT_COMPLETE: the framework's return
> +                * path restores firmware's saved interrupted context, which
> +                * would land the CPU back wherever it was running (often
> +                * do_idle, which then notices cpu_is_offline=true and BUGs
> +                * at cpuhp_report_idle_dead). Returning the modified pt_regs
> +                * doesn't help -- arch/arm64/kernel/sdei.c::do_sdei_event
> +                * only honours a PC override via its IRQ-state heuristic
> +                * and otherwise hands EL3 its own saved-context slot back.
> +                *
> +                * Trade-off: EL3 firmware retains ~one saved-context slot
> +                * per parked CPU until the next hardware reset (~hundreds of
> +                * bytes per CPU). The CPU itself is parked in cpu_park_loop
> +                * exactly as if IPI_CPU_STOP had stopped it; recoverability
> +                * is unchanged versus the existing path (neither is
> +                * recoverable without hardware reset, since PSCI sees the
> +                * CPU as ALREADY_ON in both cases).
> +                */
> +               cpu_park_loop();
> +               /* unreachable */

Any chance we could avoid duplicating stuff from ipi_cpu_crash_stop()?

> +bool sdei_nmi_crash_smp_send_stop(void)
> +{
> +       unsigned int this_cpu, cpu, remaining;
> +       unsigned long timeout;
> +       cpumask_t mask;

The above will probably get you a yell. Putting "cpumask_t" on the
stack is a no-no since it can be quite large under certain CONFIG
options. This is why it's nearly always defined as "static".

-Doug