[RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2
James Clark
james.clark at linaro.org
Tue Jan 13 06:18:40 PST 2026
On 12/01/2026 12:01 pm, Alexandru Elisei wrote:
> Hi James,
>
> On Fri, Jan 09, 2026 at 04:29:33PM +0000, James Clark wrote:
>>
>>
>> On 14/11/2025 4:07 pm, Alexandru Elisei wrote:
>>> If the SPU encounters a translation fault when it attempts to write a
>>> profiling record to memory, it stops profiling and asserts the PMBIRQ
>>> interrupt. Interrupts are not delivered instantaneously to the CPU, and
>>> this creates a profiling blackout window where the profiled CPU executes
>>> instructions, but no samples are collected.
>>>
>>> This is not desirable, and the SPE driver avoids it by keeping the buffer
>>> mapped for the entire the profiling session.
>>>
>>> KVM maps memory at stage 2 when the guest accesses it, following a fault on
>>> a missing stage 2 translation, which means that the problem is present in a
>>> SPE enabled virtual machine. Worse yet, the blackout windows are
>>> unpredictable: the guest profiling the same process can during one
>>> profiling session, not trigger any stage 2 faults (the entire buffer memory
>>> is already mapped at stage 2), but worst case scenario, during another
>>> profiling session, trigger stage 2 faults for every record it attempts to
>>> write (if KVM keeps removing the buffer pages from stage 2), or something
>>> in between - some records trigger a stage 2 fault, some don't.
>>>
>>> The solution is for KVM to follow what the SPE driver does: keep the buffer
>>> mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish
>>
>> Hi Alex,
>>
>> The problem is that the driver enables and disables the buffer every time
>> the target process is switched out unless you explicitly ask for per-CPU
>> mode. Is there some kind of heuristic you can add to prevent pinning and
>> unpinning unless something actually changes?
>>
>> Otherwise it's basically unusable with normal perf commands and larger
>> buffer sizes. Take these basic examples were I've added a filter so no SPE
>> data is even recorded:
>>
>> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>> true
>>
>> On a kernel with lockep and kmemleak etc this takes 20s to complete. On a
>> normal kernel build it still takes 4s.
>>
>> Much worse is anything more complicated than just 'true' which will have
>> more context switching:
>>
>> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>> perf stat true
>>
>> This takes 3 minutes or 50 seconds to complete (with and without kernel
>> debugging features respectively)
>>
>> For comparison, running these on the host all take less than half a second.
>> I measured each pin/unpin taking about 0.2s and the basic 'true' example
>> resulting in 100 context switches which adds up to the 20s.
>>
>> Another interesting stat is that the second example says 'true' ends up
>> running at an average clock speed of 4Mhz:
>>
>> 12683357 cycles # 0.004 GHz
>>
>> You also get warnings like this
>>
>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> rcu: Tasks blocked on level-0 rcu_node (CPUs 0-0): P53/1:b..l
>> rcu: (detected by 0, t=6503 jiffies, g=8461, q=43 ncpus=1)
>> task:perf state:R running task stack:0 pid:53 tgid:53
>> ppid:52 task_flags:0x400000 flags:0x00000008
>> Call trace:
>> __switch_to+0x1b8/0x2d8 (T)
>> __schedule+0x8b4/0x1050
>> preempt_schedule_common+0x2c/0xb8
>> preempt_schedule+0x30/0x38
>> _raw_spin_unlock+0x60/0x70
>> finish_fault+0x330/0x408
>> do_pte_missing+0x7d4/0x1188
>> handle_mm_fault+0x244/0x568
>> do_page_fault+0x21c/0x548
>> do_translation_fault+0x44/0x68
>> do_mem_abort+0x4c/0x100
>> el0_da+0x58/0x200
>> el0t_64_sync_handler+0xc0/0x130
>> el0t_64_sync+0x198/0x1a0
>
> This is awful, I was able to reproduce it.
>
>>
>> If we can't add a heuristic to keep the buffer pinned, it almost seems like
>> the random blackouts would be preferable to pinning being so slow.
>
> I guess I could make it so the memory is kept pinned when the buffer is
> disabled. And then unpin that memory only when the guest enables a buffer that
> doesn't intersect with it. And also have a timer to unpin memory so it doesn't
> stay pinned forever, together with some sort of memory aging mechanism. This is
> getting to be very complex.
>
> And all of this still requires walking the guest's stage 1
> each time the buffer is enabled, because even though the VAs might be the same,
> the VA->IPA mappings might have changed.
>
> I'll try to prototype something, see if I can get an improvement.
>
> Question: if having a large buffer is an issue, couldn't the VMM just restrict
> the buffer size? Or having a large buffer size is that important?
You could restrict the buffer size, but then you'd also be restricting
users on the same system who are happy to only use per-cpu mode but want
larger buffers.
I looked for any examples online that don't use the default buffer size
and didn't come up with anything. Linaro Forge and Arm Streamline also
support SPE but they seem to be using either the value from
perf_event_mlock_kb or some small default, so they probably aren't
problematic.
One real use case for large buffers is snapshot mode. Say you are
interested in a single event, and you want the execution history prior
to that event. The buffer wraps and overwrites continuously until you
take the snapshot, at which point the execution history is only limited
to how big the buffer was. I don't think 256MB or larger is
unreasonable, but it really depends on the user and the specific
workload or problem they're working on.
If the only real usecase of SPE was with 4MB buffers then it probably
wouldn't have been designed the way it was in the first place.
Even with the default 4M it's still very slow. It adds about 1ms to the
context switch which has a big knock on effect:
Host:
$ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,4M -- \
perf stat true
Performance counter stats for 'true':
0 context-switches # 0.0 cs/sec cs_per_second
0 cpu-migrations # 0.0 migrations/sec
40 page-faults # 126103.4 faults/sec
0.32 msec task-clock # 0.3 CPUs CPUs_utilized
6441 branch-misses # 4.0 % branch_miss_rate
162317 branches # 511.7 M/sec branch_frequency
788795 cpu-cycles # 2.5 GHz cycles_frequency
835001 instructions # 1.1 instructions insn_per_cycle
0.000783226 seconds time elapsed
0.000921000 seconds user
0.000000000 seconds sys
Guest:
$ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,4M -- \
perf stat true
Performance counter stats for 'true':
54193400 task-clock # 0.517 CPUs utilized
71 context-switches # 1.310 K/sec
0 cpu-migrations # 0.000 /sec
42 page-faults # 775.002 /sec
2652453 instructions # 0.72 insn per cycle
3659952 cycles # 0.068 GHz
1860786 stalled-cycles-frontend # 50.84% frontend cycles idle
783207 stalled-cycles-backend # 21.40% backend cycles idle
518600 branches # 9.569 M/sec
26703 branch-misses # 5.15% of all branches
0.104725080 seconds time elapsed
0.000000000 seconds user
0.089999000 seconds sys
Guest, but without SPE (just 'perf record' instead'):
$ perf record -- perf stat true
Performance counter stats for 'true':
7311680 task-clock # 0.534 CPUs utilized
70 context-switches # 9.574 K/sec
0 cpu-migrations # 0.000 /sec
41 page-faults # 5.607 K/sec
2102657 instructions # 0.88 insn per cycle
2398273 cycles # 0.328 GHz
1032732 stalled-cycles-frontend # 43.06% frontend cycles idle
589058 stalled-cycles-backend # 24.56% backend cycles idle
411830 branches # 56.325 M/sec
17839 branch-misses # 4.33% of all branches
0.013694400 seconds time elapsed
0.000000000 seconds user
0.008881000 seconds sys
So, 0.0008s -> 0.01367s -> 0.10472s elapsed time from fastest to slowest
or 2.5 GHz -> 0.328 GHz -> 0.068 GHz
Probably worth doing some real benchmarking that's not just 'true'
though. And trying to understand what scenarios are fair to compare with
each other. Either way I think SPE is supposed to be lower overhead than
that.
>
> Thanks,
> Alex
More information about the linux-arm-kernel
mailing list