[Question] arm64: KVM: Optimizing cache flush during MMU enable to single vCPU

Thu Apr 17 22:49:55 PDT 2025

Hi,

I'm investigating the cache flush behavior in the ARM64 KVM implementation, 
specifically regarding the commit 9d218a1fcf4c6b759d442ef702842fae92e1ea61 by 
Marc Zyngier that addresses cache flushing when guests enable caches.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm/kvm/mmu.c?id=9d218a1fcf4c6b759d442ef702842fae92e1ea61

In the current implementation, the function kvm_toggle_cache() in arch/arm/kvm/mmu.c 
flushes the entire VM's stage2 page tables when any vCPU toggles its cache state:

void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
{
	bool now_enabled = vcpu_has_cache_enabled(vcpu);

	/*
	 * If switching the MMU+caches on, need to invalidate the caches.
	 * If switching it off, need to clean the caches.
	 * Clean + invalidate does the trick always.
	 */
	if (now_enabled != was_enabled)
		stage2_flush_vm(vcpu->kvm);

	/* Caches are now on, stop trapping VM ops (until a S/W op) */
	if (now_enabled)
		*vcpu_hcr(vcpu) &= ~HCR_TVM;

	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
}

I'm wondering if it would be feasible to optimize this by only performing the 
flush on the first vCPU (vcpu0) that enables caches? My reasoning is:

1. During guest boot, typically vcpu0 is the first to enable caches
2. Other vCPUs would follow after vcpu0 has already flushed the caches
3. This could potentially reduce redundant cache flushes in multi-vCPU guests

Specifically, I'm considering a change like this:

void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
{
	bool now_enabled = vcpu_has_cache_enabled(vcpu);

	/*
	 * If switching the MMU+caches on, need to invalidate the caches.
	 * If switching it off, need to clean the caches.
	 * Clean + invalidate does the trick always.
	 */
	if (now_enabled != was_enabled) {
		if (vcpu->vcpu_id == 0)
			stage2_flush_vm(vcpu->kvm);
	}

	/* Caches are now on, stop trapping VM ops (until a S/W op) */
	if (now_enabled)
		*vcpu_hcr(vcpu) &= ~HCR_TVM;

	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
}

Would such an optimization be correct from a cache coherency perspective? 
Are there scenarios where each vCPU needs to perform its own flush when 
enabling caches?

I'm working on optimizing KVM performance for ARM64 systems and noticed this 
potential area for improvement.

Thank you for your insights.

Best regards,
Jiayuan Liang