[PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

Wed Jul 21 08:56:09 PDT 2021

On Wed, Jul 21, 2021, Will Deacon wrote:
> > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> > invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> > the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> > guaranteed to run with live userspace tables.
> 
> Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
> to zero, right?

Yep.

> The vCPU tasks should hold references to that afaict, so I don't think it
> should be possible for exit_mmap() to run while there are vCPUs running with
> the corresponding page-table.

Ah, right, I was thinking of non-KVM code that operated on the page tables without
holding a reference to mm_users.

> > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> > handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> > still be called, so userspace page tables aren't a problem, but
> > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> > any additional notifications that I see.  x86 deals with this by ensuring its
> > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> > running.
> 
> But the fact that x86 handles this race has me worried. What am I missing?

I don't think you're missing anything.  I forgot that KVM_RUN would require an
elevated mm_users.  x86 does handle the impossible race, but that's coincidental.
The extra protections in x86 are to deal with other cases where a vCPU's top-level
SPTE can be invalidated while the vCPU is running.