[PATCH v4 0/4] arm64: cross-CPU NMI via SDEI
Kiryl Shutsemau
kirill at shutemov.name
Mon Jun 22 06:56:16 PDT 2026
On Fri, Jun 19, 2026 at 03:26:21PM +0100, Marc Zyngier wrote:
> > Does your firmware set ICC_CTLR_EL1.PMHE? I'd be curious to see the
> > numbers if the DSB was omitted on the enable path.
>
> I certainly don't observe this sort of overhead on the HW I have
> access to, and would like to understand where this is coming from with
> actual profiling data.
Full disclosure: the ~66% figures come from internal testing about a year ago.
I no longer have the details of the machine it ran on and can't confirm whether
ICC_CTLR_EL1.PMHE was set there -- it may well have been. I shouldn't have
carried those numbers forward without being able to stand behind them, so
please disregard them.
Here are fresh numbers from NVIDIA Grace (Neoverse V2). Importantly, this
box reports:
GICv3: Pseudo-NMIs enabled using relaxed ICC_PMR_EL1 synchronisation
i.e. PMHE == 0, so the synchronising DSB on the unmask path is already
patched to a NOP (ARM64_HAS_GIC_PRIO_RELAXED_SYNC). What's left is the
floor cost of PMR-based masking itself plus the PMR save/restore on
exception entry/exit -- not the DSB. So this is the case Catalin asked
about (DSB omitted), and there is still a measurable cost.
A trivial single-threaded gettid() loop (1e6 calls, median of 5,
performance governor, ASLR off):
pseudo_nmi=0 (DAIF): 178.4 ns/call
pseudo_nmi=1 (PMR): 252.5 ns/call
delta: +74.1 ns/call (~230-250 cycles)
+41.5% wall time / 0.706 throughput
--- u-bench.c ---
#include <unistd.h>
#include <sys/syscall.h>
#include <time.h>
#include <stdio.h>
int main(void) {
struct timespec a, b;
clock_gettime(CLOCK_MONOTONIC, &a);
for (long i = 0; i < 1000000; i++)
syscall(SYS_gettid);
clock_gettime(CLOCK_MONOTONIC, &b);
printf("%f ns\n", (b.tv_sec-a.tv_sec)*1e9 + (b.tv_nsec-a.tv_nsec));
return 0;
}
will-it-scale agrees independently. sched_yield (ops/s, median of 5):
1 task 72 tasks
pseudo_nmi=0 3,195,656 230,824,534
pseudo_nmi=1 2,253,753 163,914,837
ratio 0.705 0.710
The ratio is flat across the whole 1-to-72 sweep, so -- relevant to the
scalability question -- it's a constant per-syscall tax, not a contention
effect. The impact tracks syscall/exception density: page_fault1, a more
realistic workload, stays within ~5%.
> The direction of travel is to deprecate SDEI. I wouldn't add more stuff
> on top of this interface.
I understand FEAT_NMI is the long-term answer, and I'm not arguing against
deprecating SDEI. My concern is the gap in between. By our estimate it's
10+ years before the last non-FEAT_NMI machine retires from the fleet --
for scale, we're still running Skylake today. So there's roughly a
decade where a large installed base has neither FEAT_NMI nor affordable
pseudo-NMI, and no way to reach a DAIF-masked CPU for an all-CPU
backtrace or to capture a wedged CPU in a crash dump. That's the
functional gap this series tries to cover.
Given the deprecation direction, I deliberately kept the SDEI footprint as
small as I could. The series adds no new firmware interface and no vendor
SMC -- it uses only the standard software-signalled event (event 0) via
SDEI_EVENT_SIGNAL, which is already present on these systems for
firmware-first RAS (APEI/GHES). And SDEI is only ever invoked in a "bad
state": to deliver a backtrace signal to a CPU that a normal IPI can't
reach, or to stop a CPU that ignored the stop IPIs. Nothing on any hot or
steady-state path touches it.
If even that minimal use is unacceptable on a deprecated interface, I'd
rather know now and redirect the effort -- but I'd appreciate a pointer to
what should cover this gap for existing silicon in the meantime.
--
Kiryl Shutsemau / Kirill A. Shutemov
More information about the linux-arm-kernel
mailing list