[RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2

James Clark james.clark at linaro.org
Tue Jan 13 06:18:40 PST 2026



On 12/01/2026 12:01 pm, Alexandru Elisei wrote:
> Hi James,
> 
> On Fri, Jan 09, 2026 at 04:29:33PM +0000, James Clark wrote:
>>
>>
>> On 14/11/2025 4:07 pm, Alexandru Elisei wrote:
>>> If the SPU encounters a translation fault when it attempts to write a
>>> profiling record to memory, it stops profiling and asserts the PMBIRQ
>>> interrupt.  Interrupts are not delivered instantaneously to the CPU, and
>>> this creates a profiling blackout window where the profiled CPU executes
>>> instructions, but no samples are collected.
>>>
>>> This is not desirable, and the SPE driver avoids it by keeping the buffer
>>> mapped for the entire the profiling session.
>>>
>>> KVM maps memory at stage 2 when the guest accesses it, following a fault on
>>> a missing stage 2 translation, which means that the problem is present in a
>>> SPE enabled virtual machine. Worse yet, the blackout windows are
>>> unpredictable: the guest profiling the same process can during one
>>> profiling session, not trigger any stage 2 faults (the entire buffer memory
>>> is already mapped at stage 2), but worst case scenario, during another
>>> profiling session, trigger stage 2 faults for every record it attempts to
>>> write (if KVM keeps removing the buffer pages from stage 2), or something
>>> in between - some records trigger a stage 2 fault, some don't.
>>>
>>> The solution is for KVM to follow what the SPE driver does: keep the buffer
>>> mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish
>>
>> Hi Alex,
>>
>> The problem is that the driver enables and disables the buffer every time
>> the target process is switched out unless you explicitly ask for per-CPU
>> mode. Is there some kind of heuristic you can add to prevent pinning and
>> unpinning unless something actually changes?
>>
>> Otherwise it's basically unusable with normal perf commands and larger
>> buffer sizes. Take these basic examples were I've added a filter so no SPE
>> data is even recorded:
>>
>>   $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>>       true
>>
>> On a kernel with lockep and kmemleak etc this takes 20s to complete. On a
>> normal kernel build it still takes 4s.
>>
>> Much worse is anything more complicated than just 'true' which will have
>> more context switching:
>>
>> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>>      perf stat true
>>
>> This takes 3 minutes or 50 seconds to complete (with and without kernel
>> debugging features respectively)
>>
>> For comparison, running these on the host all take less than half a second.
>> I measured each pin/unpin taking about 0.2s and the basic 'true' example
>> resulting in 100 context switches which adds up to the 20s.
>>
>> Another interesting stat is that the second example says 'true' ends up
>> running at an average clock speed of 4Mhz:
>>
>>            12683357      cycles   #    0.004 GHz
>>
>> You also get warnings like this
>>
>>   rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>   rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-0): P53/1:b..l
>>   rcu: 	(detected by 0, t=6503 jiffies, g=8461, q=43 ncpus=1)
>>   task:perf            state:R  running task     stack:0     pid:53 tgid:53
>> ppid:52     task_flags:0x400000 flags:0x00000008
>>   Call trace:
>>    __switch_to+0x1b8/0x2d8 (T)
>>    __schedule+0x8b4/0x1050
>>    preempt_schedule_common+0x2c/0xb8
>>    preempt_schedule+0x30/0x38
>>    _raw_spin_unlock+0x60/0x70
>>    finish_fault+0x330/0x408
>>    do_pte_missing+0x7d4/0x1188
>>    handle_mm_fault+0x244/0x568
>>    do_page_fault+0x21c/0x548
>>    do_translation_fault+0x44/0x68
>>    do_mem_abort+0x4c/0x100
>>    el0_da+0x58/0x200
>>    el0t_64_sync_handler+0xc0/0x130
>>    el0t_64_sync+0x198/0x1a0
> 
> This is awful, I was able to reproduce it.
> 
>>
>> If we can't add a heuristic to keep the buffer pinned, it almost seems like
>> the random blackouts would be preferable to pinning being so slow.
> 
> I guess I could make it so the memory is kept pinned when the buffer is
> disabled. And then unpin that memory only when the guest enables a buffer that
> doesn't intersect with it. And also have a timer to unpin memory so it doesn't
> stay pinned forever, together with some sort of memory aging mechanism. This is
> getting to be very complex.
> 
> And all of this still requires walking the guest's stage 1
> each time the buffer is enabled, because even though the VAs might be the same,
> the VA->IPA mappings might have changed.
> 
> I'll try to prototype something, see if I can get an improvement.
> 
> Question: if having a large buffer is an issue, couldn't the VMM just restrict
> the buffer size? Or having a large buffer size is that important?

You could restrict the buffer size, but then you'd also be restricting 
users on the same system who are happy to only use per-cpu mode but want 
larger buffers.

I looked for any examples online that don't use the default buffer size 
and didn't come up with anything. Linaro Forge and Arm Streamline also 
support SPE but they seem to be using either the value from 
perf_event_mlock_kb or some small default, so they probably aren't 
problematic.

One real use case for large buffers is snapshot mode. Say you are 
interested in a single event, and you want the execution history prior 
to that event. The buffer wraps and overwrites continuously until you 
take the snapshot, at which point the execution history is only limited 
to how big the buffer was. I don't think 256MB or larger is 
unreasonable, but it really depends on the user and the specific 
workload or problem they're working on.

If the only real usecase of SPE was with 4MB buffers then it probably 
wouldn't have been designed the way it was in the first place.

Even with the default 4M it's still very slow. It adds about 1ms to the 
context switch which has a big knock on effect:

Host:

$ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,4M -- \
   perf stat true

   Performance counter stats for 'true':

        0      context-switches   #      0.0 cs/sec  cs_per_second
        0      cpu-migrations     #      0.0 migrations/sec
       40      page-faults        # 126103.4 faults/sec
     0.32 msec task-clock         #      0.3 CPUs  CPUs_utilized
     6441      branch-misses      #      4.0 %  branch_miss_rate
   162317      branches           #    511.7 M/sec  branch_frequency
   788795      cpu-cycles         #      2.5 GHz  cycles_frequency
   835001      instructions       #      1.1 instructions  insn_per_cycle

   0.000783226 seconds time elapsed

   0.000921000 seconds user
   0.000000000 seconds sys

Guest:

$ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,4M -- \
   perf stat true

   Performance counter stats for 'true':

   54193400      task-clock               #    0.517 CPUs utilized
         71      context-switches         #    1.310 K/sec
          0      cpu-migrations           #    0.000 /sec
         42      page-faults              #  775.002 /sec
    2652453      instructions             #    0.72  insn per cycle
    3659952      cycles                   #    0.068 GHz
    1860786      stalled-cycles-frontend  #   50.84% frontend cycles idle
     783207      stalled-cycles-backend   #   21.40% backend cycles idle
     518600      branches                 #    9.569 M/sec
      26703      branch-misses            #    5.15% of all branches

   0.104725080 seconds time elapsed

   0.000000000 seconds user
   0.089999000 seconds sys

Guest, but without SPE (just 'perf record' instead'):

$ perf record -- perf stat true

  Performance counter stats for 'true':

     7311680      task-clock              #    0.534 CPUs utilized
        70      context-switches        #    9.574 K/sec
         0      cpu-migrations          #    0.000 /sec
        41      page-faults             #    5.607 K/sec
   2102657      instructions            #    0.88  insn per cycle

   2398273      cycles                  #    0.328 GHz
   1032732      stalled-cycles-frontend #   43.06% frontend cycles idle
    589058      stalled-cycles-backend  #   24.56% backend cycles idle
    411830      branches                #   56.325 M/sec
     17839      branch-misses           #    4.33% of all branches

   0.013694400 seconds time elapsed

   0.000000000 seconds user
   0.008881000 seconds sys


So, 0.0008s -> 0.01367s -> 0.10472s elapsed time from fastest to slowest
or 2.5 GHz -> 0.328 GHz -> 0.068 GHz

Probably worth doing some real benchmarking that's not just 'true' 
though. And trying to understand what scenarios are fair to compare with 
each other. Either way I think SPE is supposed to be lower overhead than 
that.

> 
> Thanks,
> Alex




More information about the linux-arm-kernel mailing list