[RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2

Fri Jan 9 08:35:46 PST 2026

On 09/01/2026 4:29 pm, James Clark wrote:
> 
> 
> On 14/11/2025 4:07 pm, Alexandru Elisei wrote:
>> If the SPU encounters a translation fault when it attempts to write a
>> profiling record to memory, it stops profiling and asserts the PMBIRQ
>> interrupt.  Interrupts are not delivered instantaneously to the CPU, and
>> this creates a profiling blackout window where the profiled CPU executes
>> instructions, but no samples are collected.
>>
>> This is not desirable, and the SPE driver avoids it by keeping the buffer
>> mapped for the entire the profiling session.
>>
>> KVM maps memory at stage 2 when the guest accesses it, following a 
>> fault on
>> a missing stage 2 translation, which means that the problem is present 
>> in a
>> SPE enabled virtual machine. Worse yet, the blackout windows are
>> unpredictable: the guest profiling the same process can during one
>> profiling session, not trigger any stage 2 faults (the entire buffer 
>> memory
>> is already mapped at stage 2), but worst case scenario, during another
>> profiling session, trigger stage 2 faults for every record it attempts to
>> write (if KVM keeps removing the buffer pages from stage 2), or something
>> in between - some records trigger a stage 2 fault, some don't.
>>
>> The solution is for KVM to follow what the SPE driver does: keep the 
>> buffer
>> mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish
> 
> Hi Alex,
> 
> The problem is that the driver enables and disables the buffer every 
> time the target process is switched out unless you explicitly ask for 
> per-CPU mode. Is there some kind of heuristic you can add to prevent 
> pinning and unpinning unless something actually changes?
> 
> Otherwise it's basically unusable with normal perf commands and larger 
> buffer sizes. Take these basic examples were I've added a filter so no 
> SPE data is even recorded:
> 
>   $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>       true
> 
> On a kernel with lockep and kmemleak etc this takes 20s to complete. On 
> a normal kernel build it still takes 4s.
> 
> Much worse is anything more complicated than just 'true' which will have 
> more context switching:
> 
> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
>      perf stat true
> 
> This takes 3 minutes or 50 seconds to complete (with and without kernel 
> debugging features respectively)
> 
> For comparison, running these on the host all take less than half a 
> second. I measured each pin/unpin taking about 0.2s and the basic 'true' 
> example resulting in 100 context switches which adds up to the 20s.
> 
> Another interesting stat is that the second example says 'true' ends up 
> running at an average clock speed of 4Mhz:
> 
>            12683357      cycles   #    0.004 GHz
> 
> You also get warnings like this
> 
>   rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>   rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-0): P53/1:b..l
>   rcu:     (detected by 0, t=6503 jiffies, g=8461, q=43 ncpus=1)
>   task:perf            state:R  running task     stack:0     pid:53 
> tgid:53    ppid:52     task_flags:0x400000 flags:0x00000008
>   Call trace:
>    __switch_to+0x1b8/0x2d8 (T)
>    __schedule+0x8b4/0x1050
>    preempt_schedule_common+0x2c/0xb8
>    preempt_schedule+0x30/0x38
>    _raw_spin_unlock+0x60/0x70
>    finish_fault+0x330/0x408
>    do_pte_missing+0x7d4/0x1188
>    handle_mm_fault+0x244/0x568
>    do_page_fault+0x21c/0x548
>    do_translation_fault+0x44/0x68
>    do_mem_abort+0x4c/0x100
>    el0_da+0x58/0x200
>    el0t_64_sync_handler+0xc0/0x130
>    el0t_64_sync+0x198/0x1a0
> 
> If we can't add a heuristic to keep the buffer pinned, it almost seems 
> like the random blackouts would be preferable to pinning being so slow.
> 

One other comment to add to this, is that increasing the buffer size is 
the normal reaction to profiling overheads being high. I think that's 
how I came across this in the first place. Or if you want to avoid 
overhead entirely you set a buffer that's large enough to handle the 
whole run.

In this case in a VM it actually has the opposite effect to doing the 
same thing on a host. The bigger you make it the worse the problem gets.