[PATCH v4 0/4] arm64: cross-CPU NMI via SDEI

Mon Jun 22 06:56:16 PDT 2026

On Fri, Jun 19, 2026 at 03:26:21PM +0100, Marc Zyngier wrote:
> > Does your firmware set ICC_CTLR_EL1.PMHE? I'd be curious to see the
> > numbers if the DSB was omitted on the enable path.
>
> I certainly don't observe this sort of overhead on the HW I have
> access to, and would like to understand where this is coming from with
> actual profiling data.

Full disclosure: the ~66% figures come from internal testing about a year ago.
I no longer have the details of the machine it ran on and can't confirm whether
ICC_CTLR_EL1.PMHE was set there -- it may well have been. I shouldn't have
carried those numbers forward without being able to stand behind them, so
please disregard them.

Here are fresh numbers from NVIDIA Grace (Neoverse V2). Importantly, this
box reports:

  GICv3: Pseudo-NMIs enabled using relaxed ICC_PMR_EL1 synchronisation

i.e. PMHE == 0, so the synchronising DSB on the unmask path is already
patched to a NOP (ARM64_HAS_GIC_PRIO_RELAXED_SYNC). What's left is the
floor cost of PMR-based masking itself plus the PMR save/restore on
exception entry/exit -- not the DSB. So this is the case Catalin asked
about (DSB omitted), and there is still a measurable cost.

A trivial single-threaded gettid() loop (1e6 calls, median of 5,
performance governor, ASLR off):

  pseudo_nmi=0 (DAIF):       178.4 ns/call
  pseudo_nmi=1 (PMR):        252.5 ns/call
  delta:                     +74.1 ns/call  (~230-250 cycles)
                             +41.5% wall time / 0.706 throughput

  --- u-bench.c ---
  #include <unistd.h>
  #include <sys/syscall.h>
  #include <time.h>
  #include <stdio.h>
  int main(void) {
          struct timespec a, b;
          clock_gettime(CLOCK_MONOTONIC, &a);
          for (long i = 0; i < 1000000; i++)
                  syscall(SYS_gettid);
          clock_gettime(CLOCK_MONOTONIC, &b);
          printf("%f ns\n", (b.tv_sec-a.tv_sec)*1e9 + (b.tv_nsec-a.tv_nsec));
          return 0;
  }

will-it-scale agrees independently. sched_yield (ops/s, median of 5):

                      1 task       72 tasks
  pseudo_nmi=0      3,195,656    230,824,534
  pseudo_nmi=1      2,253,753    163,914,837
  ratio                0.705          0.710

The ratio is flat across the whole 1-to-72 sweep, so -- relevant to the
scalability question -- it's a constant per-syscall tax, not a contention
effect. The impact tracks syscall/exception density: page_fault1, a more
realistic workload, stays within ~5%.

> The direction of travel is to deprecate SDEI. I wouldn't add more stuff
> on top of this interface.

I understand FEAT_NMI is the long-term answer, and I'm not arguing against
deprecating SDEI. My concern is the gap in between. By our estimate it's
10+ years before the last non-FEAT_NMI machine retires from the fleet --
for scale, we're still running Skylake today. So there's roughly a
decade where a large installed base has neither FEAT_NMI nor affordable
pseudo-NMI, and no way to reach a DAIF-masked CPU for an all-CPU
backtrace or to capture a wedged CPU in a crash dump. That's the
functional gap this series tries to cover.

Given the deprecation direction, I deliberately kept the SDEI footprint as
small as I could. The series adds no new firmware interface and no vendor
SMC -- it uses only the standard software-signalled event (event 0) via
SDEI_EVENT_SIGNAL, which is already present on these systems for
firmware-first RAS (APEI/GHES). And SDEI is only ever invoked in a "bad
state": to deliver a backtrace signal to a CPU that a normal IPI can't
reach, or to stop a CPU that ignored the stop IPIs. Nothing on any hot or
steady-state path touches it.

If even that minimal use is unacceptable on a deprecated interface, I'd
rather know now and redirect the effort -- but I'd appreciate a pointer to
what should cover this gap for existing silicon in the meantime.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov