[PATCH] KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest context

James Clark james.clark at linaro.org
Fri Feb 20 03:42:11 PST 2026



On 16/02/2026 4:49 pm, Marc Zyngier wrote:
> On Mon, 16 Feb 2026 16:10:14 +0000,
> James Clark <james.clark at linaro.org> wrote:
>>
>>
>>
>> On 16/02/2026 3:51 pm, Marc Zyngier wrote:
>>> On Mon, 16 Feb 2026 15:05:10 +0000,
>>> James Clark <james.clark at linaro.org> wrote:
>>>>
>>>>
>>>>
>>>> On 16/02/2026 2:29 pm, Marc Zyngier wrote:
>>>>> On Mon, 16 Feb 2026 13:09:59 +0000,
>>>>> Will Deacon <will at kernel.org> wrote:
>>>>>>
>>>>>> The nVHE world-switch code relies on zeroing TRFCR_EL1 to disable trace
>>>>>> generation in guest context when self-hosted TRBE is in use by the host.
>>>>>>
>>>>>> Per D3.2.1 ("Controls to prohibit trace at Exception levels"), clearing
>>>>>> TRFCR_EL1 means that trace generation is prohibited at EL1 and EL0 but
>>>>>> per R_YCHKJ the Trace Buffer Unit will still be enabled if
>>>>>> TRBLIMITR_EL1.E is set. R_SJFRQ goes on to state that, when enabled, the
>>>>>> Trace Buffer Unit can perform address translation for the "owning
>>>>>> exception level" even when it is out of context.
>>>>>
>>>>> Great. So TRBE violates all the principles that we hold true in the
>>>>> architecture. Does SPE suffer from the same level of brokenness?
>>>>>
>>>>>> Consequently, we can end up in a state where TRBE performs speculative
>>>>>> page-table walks for a host VA/IPA in guest/hypervisor context depending
>>>>>> on the value of MDCR_EL2.E2TB, which changes over world-switch. The
>>>>>> result appears to be a heady mixture of data corruption and hardware
>>>>>> lockups.
>>>>>>
>>>>>> Extend the TRBE world-switch code to clear TRBLIMITR_EL1.E after
>>>>>> draining the buffer, restoring the register on return to the host.
>>>>>>
>>>>>> Cc: Marc Zyngier <maz at kernel.org>
>>>>>> Cc: Oliver Upton <oupton at kernel.org>
>>>>>> Cc: James Clark <james.clark at linaro.org>
>>>>>> Cc: Leo Yan <leo.yan at arm.com>
>>>>>> Cc: Suzuki K Poulose <suzuki.poulose at arm.com>
>>>>>> Cc: Fuad Tabba <tabba at google.com>
>>>>>> Fixes: a1319260bf62 ("arm64: KVM: Enable access to TRBE support for host")
>>>>>> Signed-off-by: Will Deacon <will at kernel.org>
>>>>>> ---
>>>>>>
>>>>>> NOTE: This is *untested* as I don't have a TRBE-capable device that can
>>>>>> run upstream but I noticed this by inspection when triaging occasional
>>>>>> hardware lockups on systems using a 6.12-based kernel with TRBE running
>>>>>> at the same time as a vCPU is loaded. This code has changed quite a bit
>>>>>> over time, so stable backports are not entirely straightforward.
>>>>>> Hopefully James/Leo/Suzuki can help us test if folks agree with the
>>>>>> general approach taken here.
>>>>>>
>>>>>>     arch/arm64/include/asm/kvm_host.h  |  1 +
>>>>>>     arch/arm64/kvm/hyp/nvhe/debug-sr.c | 36 ++++++++++++++++++++++--------
>>>>>>     2 files changed, 28 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>>>> index ac7f970c7883..a932cf043b83 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>>>> @@ -746,6 +746,7 @@ struct kvm_host_data {
>>>>>>     		u64 pmscr_el1;
>>>>>>     		/* Self-hosted trace */
>>>>>>     		u64 trfcr_el1;
>>>>>> +		u64 trblimitr_el1;
>>>>>>     		/* Values of trap registers for the host before guest entry. */
>>>>>>     		u64 mdcr_el2;
>>>>>>     		u64 brbcr_el1;
>>>>>> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>>>> index 2a1c0f49792b..fd389a26bc59 100644
>>>>>> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>>>> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>>>> @@ -57,12 +57,27 @@ static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr)
>>>>>>     	write_sysreg_el1(new_trfcr, SYS_TRFCR);
>>>>>>     }
>>>>>>     -static bool __trace_needs_drain(void)
>>>>>> +static void __trace_drain_and_disable(void)
>>>>>>     {
>>>>>> -	if (is_protected_kvm_enabled() && host_data_test_flag(HAS_TRBE))
>>>>>> -		return read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E;
>>>>>> +	u64 *trblimitr_el1 = host_data_ptr(host_debug_state.trblimitr_el1);
>>>>>>     -	return host_data_test_flag(TRBE_ENABLED);
>>>>>> +	*trblimitr_el1 = 0;
>>>>>> +
>>>>>> +	if (is_protected_kvm_enabled()) {
>>>>>> +		if (!host_data_test_flag(HAS_TRBE))
>>>>>> +			return;
>>>>>> +	} else {
>>>>>> +		if (!host_data_test_flag(TRBE_ENABLED))
>>>>>> +			return;
>>>>>> +	}
>>>>>> +
>>>>>> +	*trblimitr_el1 = read_sysreg_s(SYS_TRBLIMITR_EL1);
>>>>>> +	if (*trblimitr_el1 & TRBLIMITR_EL1_E) {
>>>>>> +		isb();
>>>>>> +		tsb_csync();
>>>>>> +		write_sysreg_s(0, SYS_TRBLIMITR_EL1);
>>>>>> +		isb();
>>>>
>>>> The TRBE driver might do an extra drain here as a workaround. Hard to
>>>> tell if it's actually required in this case (seems like probably not)
>>>> but it might be worth doing it anyway to avoid hitting the
>>>> issue. Especially if we add guest support later where some of the
>>>> affected registers might start being used.
>>>
>>> Just to set the expectations: guest TRBE support is not happening
>>> until the architecture is fixed. It cannot reliably give a trace that
>>> includes emulated exceptions, and until then, no TRBE for you.
>>>
>>>> See:
>>>>
>>>>       if (trbe_needs_drain_after_disable(cpudata))
>>>>           trbe_drain_buffer();
>>>>
>>>>
>>>>>> +	}
>>>>>
>>>>> Doesn't this mean we should be able to get rid of most of the TRFCR
>>>>> messing about that litters the entry/exit code and leave that to VHE
>>>>
>>>> Technically you could have ETMs that and are connected to sinks other
>>>> than TRBE. Unless you somehow switch off those sinks you still need to
>>>> do the TRFCR switching stuff.
>>>>
>>>>> only? And even then, I'm tempted to simply get rid of any sort of
>>>>> guest-only tracing, given that TRBE is not capable of representing
>>>>> exceptions that are synthesised by the host, making it the resulting
>>>>> traces useless.
>>>>
>>>> I haven't heard of anyone tracing a guest from the host, but until we
>>>> add support for guests to be able to trace themselves it's the only
>>>> way of doing it, so it could be useful.
>>>
>>> But that's *not* working. If you trace EL1 only, even with a VHE host,
>>> the result is not usable.
>>>
>>
>> Do you mean not working because of the missing exceptions? I did a bit
>> of testing before and the trace did seem somewhat usable to me. It had
>> EL1 and EL0 atoms in there.
> 
> Sure. Now try to look at what that means for NV, where all the
> EL1->EL2 exceptions are emulated, where all the EL2->EL1 exception
> returns are emulated.
> 
> What does it give you? A bag of nonsense.
> 
> Same thing for EL2->EL0, by the way, so you can't even correctly
> profile an EL0 program that performs a syscall, or that gets
> interrupted. And while without NV, these exceptions are rare, having a
> trace that is unreliable has the potential of being worse than no
> trace at all.

If there are issues with NV perhaps we can skip it for the initial trace 
virtualisation implementation? I'm not familiar with it but isn't NV 
still an experimental feature anyway? I can't imagine actual users who 
want to do tracing in guests would accept that they can't do tracing on 
a non-NV guest because there is something that doesn't work in NV.

Also do you have an example of these exceptions that you mean without NV 
so I can have a look? I have a hack that allows basic use of ETE/TRBE in 
VHE mode and did some recordings of syscalls and they end up looking ok 
in the decoded trace:

  $ perf record -e cs_etm/timestamp=0/u -C 0 perf bench syscall basic
  $ perf script

Results in:

bench_syscall_common+0xb4 => aaaaaffcafe0 getppid at plt+0x0
getppid at plt+0xc           => ffffa78980c0 getppid+0x0 (libc.so.6)
getppid+0x8 (libc.so.6)   =>           0 [unknown] ([unknown])
[unknown] ([unknown])     => ffffa78980cc getppid+0xc (libc.so.6)
getppid+0xc (libc.so.6)   => aaaab0076564 bench_syscall_common+0xb8
bench_syscall_common+0xb8 => aaaab00765d8 bench_syscall_common+0x12c

Which shows jumping from the bench function to getppid(), then doing the 
syscall into the kernel which is "0 [unknown]" because I recorded with 
/u. Then back to the bench loop again.

> 
> Until the architecture grows a way for KVM to inject the missing
> information into the trace, TRBE support for guest will stay out.
> 
>> All you need is the mmap records from the
>> guest which you can get by running Perf in the guest and it's possible
>> to decode it. Maybe it's not complete but I don't think all use cases
>> require complete trace. AutoFDO for example just needs lots of small
>> snippets of execution history.
> 
> I don't think it is OK to feed an FDO with traces that are known to be
> incomplete. Maybe that goes under the radar today, but my crystal ball
> is telling me things could be very different in the future, and I'm
> not going to take any bet.

The preset we added for AutoFDO 
(drivers/hwtracing/coresight/coresight-cfg-afdo.c) specifically turns 
tracing on and off to give small incomplete snippets distributed across 
the whole process but while reducing the total amount of trace. I think 
that is one way to do AutoFDO and the compiler can handle it. Anyway, 
AutoFDO is just one use case for trace, and an example that incomplete 
is better than nothing.

In addition to that, the way the ETR and TRBE buffers are currently used 
they're pretty bad at actually recording everything without gaps. 
Although in theory with TRBE it's possible record everything without 
dropping anything, it's still something Leo is working on.

> 
> Thanks,
> 
> 	M.
> 




More information about the linux-arm-kernel mailing list