[RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2
James Clark
james.clark at linaro.org
Fri Jan 9 08:35:46 PST 2026
On 09/01/2026 4:29 pm, James Clark wrote:
>
>
> On 14/11/2025 4:07 pm, Alexandru Elisei wrote:
>> If the SPU encounters a translation fault when it attempts to write a
>> profiling record to memory, it stops profiling and asserts the PMBIRQ
>> interrupt. Interrupts are not delivered instantaneously to the CPU, and
>> this creates a profiling blackout window where the profiled CPU executes
>> instructions, but no samples are collected.
>>
>> This is not desirable, and the SPE driver avoids it by keeping the buffer
>> mapped for the entire the profiling session.
>>
>> KVM maps memory at stage 2 when the guest accesses it, following a
>> fault on
>> a missing stage 2 translation, which means that the problem is present
>> in a
>> SPE enabled virtual machine. Worse yet, the blackout windows are
>> unpredictable: the guest profiling the same process can during one
>> profiling session, not trigger any stage 2 faults (the entire buffer
>> memory
>> is already mapped at stage 2), but worst case scenario, during another
>> profiling session, trigger stage 2 faults for every record it attempts to
>> write (if KVM keeps removing the buffer pages from stage 2), or something
>> in between - some records trigger a stage 2 fault, some don't.
>>
>> The solution is for KVM to follow what the SPE driver does: keep the
>> buffer
>> mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish
>
> Hi Alex,
>
> The problem is that the driver enables and disables the buffer every
> time the target process is switched out unless you explicitly ask for
> per-CPU mode. Is there some kind of heuristic you can add to prevent
> pinning and unpinning unless something actually changes?
>
> Otherwise it's basically unusable with normal perf commands and larger
> buffer sizes. Take these basic examples were I've added a filter so no
> SPE data is even recorded:
>
> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
> true
>
> On a kernel with lockep and kmemleak etc this takes 20s to complete. On
> a normal kernel build it still takes 4s.
>
> Much worse is anything more complicated than just 'true' which will have
> more context switching:
>
> $ perf record -e arm_spe/min_latency=1000,event_filter=10/ -m,256M --\
> perf stat true
>
> This takes 3 minutes or 50 seconds to complete (with and without kernel
> debugging features respectively)
>
> For comparison, running these on the host all take less than half a
> second. I measured each pin/unpin taking about 0.2s and the basic 'true'
> example resulting in 100 context switches which adds up to the 20s.
>
> Another interesting stat is that the second example says 'true' ends up
> running at an average clock speed of 4Mhz:
>
> 12683357 cycles # 0.004 GHz
>
> You also get warnings like this
>
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: Tasks blocked on level-0 rcu_node (CPUs 0-0): P53/1:b..l
> rcu: (detected by 0, t=6503 jiffies, g=8461, q=43 ncpus=1)
> task:perf state:R running task stack:0 pid:53
> tgid:53 ppid:52 task_flags:0x400000 flags:0x00000008
> Call trace:
> __switch_to+0x1b8/0x2d8 (T)
> __schedule+0x8b4/0x1050
> preempt_schedule_common+0x2c/0xb8
> preempt_schedule+0x30/0x38
> _raw_spin_unlock+0x60/0x70
> finish_fault+0x330/0x408
> do_pte_missing+0x7d4/0x1188
> handle_mm_fault+0x244/0x568
> do_page_fault+0x21c/0x548
> do_translation_fault+0x44/0x68
> do_mem_abort+0x4c/0x100
> el0_da+0x58/0x200
> el0t_64_sync_handler+0xc0/0x130
> el0t_64_sync+0x198/0x1a0
>
> If we can't add a heuristic to keep the buffer pinned, it almost seems
> like the random blackouts would be preferable to pinning being so slow.
>
One other comment to add to this, is that increasing the buffer size is
the normal reaction to profiling overheads being high. I think that's
how I came across this in the first place. Or if you want to avoid
overhead entirely you set a buffer that's large enough to handle the
whole run.
In this case in a VM it actually has the opposite effect to doing the
same thing on a host. The bigger you make it the worse the problem gets.
More information about the linux-arm-kernel
mailing list