[PATCH v2 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort
Doug Anderson
dianders at chromium.org
Wed Jun 10 15:50:32 PDT 2026
Hi,
On Tue, Jun 9, 2026 at 6:58 AM Kiryl Shutsemau <kirill at shutemov.name> wrote:
>
> @@ -910,6 +911,35 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> #endif
> }
>
> +#ifdef CONFIG_ARM_SDEI_NMI
> +/*
> + * Stop entry for the SDEI cross-CPU NMI service: its event-0 handler
> + * lands here when this CPU was asked to stop. The bookkeeping mirrors
> + * the IPI_CPU_STOP{,_NMI} handling; the park happens inside the SDEI
> + * event, which is never completed -- completing it would have firmware
> + * resume the interrupted (typically wedged) context. No PSCI CPU_OFF
> + * either: powering off a PE that EL3 still considers mid-event invites
> + * firmware trouble.
> + */
> +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs)
> +{
> + unsigned int cpu = smp_processor_id();
> +
> + local_daif_mask();
> +
> + if (IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop)
> + crash_save_cpu(regs, cpu);
> +
> + /* the ack the stop requester polls for */
> + set_cpu_online(cpu, false);
> +
> + sdei_mask_local_cpu();
> +
> + cpu_park_loop();
> +}
> +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> +#endif
Can we combine everything into one function so we don't have to keep
all the different stop functionality in sync? Like this (untested):
void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
{
unsigned int cpu = smp_processor_id();
bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;
/*
* Use local_daif_mask() instead of local_irq_disable() to make sure
* that pseudo-NMIs are disabled. The "stop" code starts with
* an IRQ and falls back to NMI (which might be pseudo). If the IRQ
* finally goes through right as we're timing out then the NMI could
* interrupt us. It's better to prevent the NMI and let the IRQ
* finish since the pt_regs will be better.
*/
local_daif_mask();
if (crash)
crash_save_cpu(regs, cpu);
set_cpu_online(cpu, false);
sdei_mask_local_cpu();
if (crash && die_on_crash)
__cpu_try_die(cpu);
/* just in case */
cpu_park_loop();
}
Then in do_handle_IPI(), it's just:
case IPI_CPU_STOP:
case IPI_CPU_STOP_NMI:
arm64_nmi_cpu_stop(get_irq_regs(), true);
break;
...and from the SDEI code you pass "false" for "die_on_crash".
Perhaps when doing that you'd stop unconditionally getting the cpu in
do_handle_IPI() and just call it for `IPI_KGDB_ROUNDUP` since it would
now be the only consumer of that local variable.
FWIW, I'm not totally sure I followed the logic for why "die_on_crash"
needs to be "false" for the SDEI case, but I also am not terribly
familiar with KEXEC and crash-dumping kernels, since I've never
attempted to get that feature to work.
> @@ -1263,6 +1293,29 @@ void smp_send_stop(void)
> udelay(1);
> }
>
> + /*
> + * If CPUs are *still* online, try the SDEI cross-CPU NMI. Firmware
> + * delivers it regardless of the target's DAIF state, so it reaches
> + * a CPU spinning with interrupts masked, which neither rung above
> + * could (without pseudo-NMI there is no NMI rung at all). Allow
> + * 100ms: a firmware round-trip per CPU, with headroom.
> + */
> + if (num_other_online_cpus()) {
> + /* re-snapshot after the rungs above took CPUs offline */
> + smp_rmb();
> + cpumask_copy(&mask, cpu_online_mask);
> + cpumask_clear_cpu(smp_processor_id(), &mask);
> +
> + if (sdei_nmi_stop_cpus(&mask)) {
> + pr_info("SMP: retry stop with SDEI NMI for CPUs %*pbl\n",
> + cpumask_pr_args(&mask));
Perhaps it's being a bit pedantic, but it's a little weird that you're
printing a message that sounds like "I'm going to retry with SDEI"
after you've already done it. It feels like it would be nominally
cleaner (and more parellel with the pseudo-NMI case) if you could
separately test if SDEI is available. Then sdei_nmi_stop_cpus() would
just return void?
> @@ -59,8 +64,45 @@ static bool sdei_nmi_available;
>
> #define SDEI_NMI_EVENT 0
>
> +/*
> + * Stop-request dispatch lives on the same SDEI event 0 as everything
> + * else. The requesting CPU sets each target's bit in sdei_nmi_stop_mask
> + * before signalling event 0; the target's handler test-and-clears its
> + * bit and hands the CPU to arm64_nmi_cpu_stop(), which saves crash
> + * state when the stop is a kdump crash-stop, marks the CPU offline
> + * (which is what the requester polls for) and parks it.
> + *
> + * This mirrors the cpumask the framework's nmi_cpu_backtrace() consults
> + * just below, and a shared mask rather than a separate SDEI event avoids
> + * extra registrations from firmware.
> + */
Do you have any reasoning for why you don't pick a separate EVENT ID
for "backtrace" vs. "stop". If you absolutely have to share an ID
because they're a limited resource then I guess it's fine, but it
would make the code easier to understand / reason about if they were
separate IDs.
If you had a separate EVENT ID, then it seems like you could
completely eliminate the (potentially large) `sdei_nmi_stop_mask`
variable, right? Any time a "STOP" event fires you can unconditionally
consider it to be a stop w/ no globals needed, right?
> @@ -115,6 +157,35 @@ bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
> return true;
> }
>
> +/*
> + * Last rung of the stop escalation in smp_send_stop() (see
> + * arch/arm64/kernel/smp.c). The caller runs the regular stop IPI (and
> + * the pseudo-NMI stop IPI, where available) first; @mask holds whatever
> + * stayed online through those -- typically CPUs wedged with interrupts
> + * masked, unreachable by an IPI. Set each target's stop-request flag and
> + * signal event 0 at it; a target acks by marking itself offline, which
> + * the caller polls for.
> + *
> + * Returns false when SDEI isn't active, so the caller can skip the wait.
> + */
> +bool sdei_nmi_stop_cpus(const cpumask_t *mask)
> +{
> + unsigned int cpu;
> +
> + if (!sdei_nmi_available)
> + return false;
> +
> + cpumask_or(&sdei_nmi_stop_mask, &sdei_nmi_stop_mask, mask);
As per above, hopefully we can get rid of `sdei_nmi_stop_mask`. ...but
if we keep it, I'm curious why "or" and not "copy"?
More information about the kexec
mailing list