[RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2

Fri Jan 9 08:29:33 PST 2026

On 14/11/2025 4:07 pm, Alexandru Elisei wrote:
> If the SPU encounters a translation fault when it attempts to write a
> profiling record to memory, it stops profiling and asserts the PMBIRQ
> interrupt.  Interrupts are not delivered instantaneously to the CPU, and
> this creates a profiling blackout window where the profiled CPU executes
> instructions, but no samples are collected.
> 
> This is not desirable, and the SPE driver avoids it by keeping the buffer
> mapped for the entire the profiling session.
> 
> KVM maps memory at stage 2 when the guest accesses it, following a fault on
> a missing stage 2 translation, which means that the problem is present in a
> SPE enabled virtual machine. Worse yet, the blackout windows are
> unpredictable: the guest profiling the same process can during one
> profiling session, not trigger any stage 2 faults (the entire buffer memory
> is already mapped at stage 2), but worst case scenario, during another
> profiling session, trigger stage 2 faults for every record it attempts to
> write (if KVM keeps removing the buffer pages from stage 2), or something
> in between - some records trigger a stage 2 fault, some don't.
> 
> The solution is for KVM to follow what the SPE driver does: keep the buffer
> mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish

Hi Alex,

The problem is that the driver enables and disables the buffer every 
time the target process is switched out unless you explicitly ask for 
per-CPU mode. Is there some kind of heuristic you can add to prevent 
pinning and unpinning unless something actually changes?

Otherwise it's basically unusable with normal perf commands and larger 
buffer sizes. Take these basic examples were I've added a filter so no 
SPE data is even recorded:

  $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
      true

On a kernel with lockep and kmemleak etc this takes 20s to complete. On 
a normal kernel build it still takes 4s.

Much worse is anything more complicated than just 'true' which will have 
more context switching:

$ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
     perf stat true

This takes 3 minutes or 50 seconds to complete (with and without kernel 
debugging features respectively)

For comparison, running these on the host all take less than half a 
second. I measured each pin/unpin taking about 0.2s and the basic 'true' 
example resulting in 100 context switches which adds up to the 20s.

Another interesting stat is that the second example says 'true' ends up 
running at an average clock speed of 4Mhz:

           12683357      cycles   #    0.004 GHz

You also get warnings like this

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-0): P53/1:b..l
  rcu: 	(detected by 0, t=6503 jiffies, g=8461, q=43 ncpus=1)
  task:perf            state:R  running task     stack:0     pid:53 
tgid:53    ppid:52     task_flags:0x400000 flags:0x00000008
  Call trace:
   __switch_to+0x1b8/0x2d8 (T)
   __schedule+0x8b4/0x1050
   preempt_schedule_common+0x2c/0xb8
   preempt_schedule+0x30/0x38
   _raw_spin_unlock+0x60/0x70
   finish_fault+0x330/0x408
   do_pte_missing+0x7d4/0x1188
   handle_mm_fault+0x244/0x568
   do_page_fault+0x21c/0x548
   do_translation_fault+0x44/0x68
   do_mem_abort+0x4c/0x100
   el0_da+0x58/0x200
   el0t_64_sync_handler+0xc0/0x130
   el0t_64_sync+0x198/0x1a0

If we can't add a heuristic to keep the buffer pinned, it almost seems 
like the random blackouts would be preferable to pinning being so slow.