[PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

Tue Jul 20 13:33:46 PDT 2021

On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
functionality is common across x86 and arm64.

KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
associated with the VM, and disallows calling ioctls from a different process,
i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
mm_struct itself is live.

For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
invoked at the beginning of exit_mmap(), before the page tables are freed.  In
its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
guaranteed to run with live userspace tables.

Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
invalidate_range implementations also take mmu_lock, and also update a sequence
counter and a flag stating that there's an invalidation in progress.  When
installing a stage2 entry, KVM snapshots the sequence counter before taking
mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
mismatches, or an invalidation is in-progress, then KVM bails and resumes the
guest without fixing the fault.

E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
entries.  And if the host zap "wins" the race, KVM will resume the guest, which
in normal operation will hit the exception again and go back through the entire
process of installing stage2 entries.

Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
still be called, so userspace page tables aren't a problem, but
kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
any additional notifications that I see.  x86 deals with this by ensuring its
top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
running.

  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
  {
	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
	struct kvm_pgtable *pgt = NULL;

	spin_lock(&kvm->mmu_lock);
	pgt = mmu->pgt;
	if (pgt) {
		mmu->pgd_phys = 0;
		mmu->pgt = NULL;
		free_percpu(mmu->last_vcpu_ran);
	}
	spin_unlock(&kvm->mmu_lock);

	...
  }

AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
if exit_mmap() collidied with user_mem_abort().

  static int user_mem_abort(...)
  {

	...

	spin_lock(&kvm->mmu_lock);
	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
		goto out_unlock;

	...

	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
	} else {
		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
					     __pfn_to_phys(pfn), prot,
					     memcache);
	}
  }