[PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU

Mon Apr 11 16:41:12 PDT 2022

On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc at google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc at google.com> wrote:
> > > Circling back to eager page splitting, this series could be reworked to take the
> > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > some of the complexity/churn.
> >
> > These sound like useful improvements but I am not really seeing the
> > value of sequencing them before this series:
> >
> >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > existing code. They improve readability by decomposing the shadow page
> > creation path into smaller functions with better names, reduce the
> > amount of redundant calculations, and reduce the dependence on struct
> > kvm_vcpu where it is not needed. Even if eager page splitting is
> > completely dropped I think they would be useful to merge.
>
> I definitely like some of patches 1-14, probably most after a few read throughs.
> But there are key parts that I do not like that are motivated almost entirely by
> the desire to support page splitting.  Specifically, I don't like splitting the
> logic of finding a page, and I don't like having a separate alloc vs. initializer
> (though I'm guessing this will be needed somewhere to split huge pages for nested
> MMUs).
>
> E.g. I'd prefer the "get" flow look like the below (completely untested, for
> discussion purposes only).  There's still churn, but the core loop is almost
> entirely unchanged.
>
> And it's not just this series, I don't want future improvements nested TDP to have
> to deal with the legacy baggage.

One thing that would be helpful is if you can explain in a bit more
specifically what you'd like to see. Part of the reason why I prefer
to sequence your proposal after eager page splitting is that I do not
fully understand what you're proposing, and how complex it would be.
e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
for nested MMUs does not sound like less churn.

>From my perspective, this series is a net improvement to the
readability and maintainability of existing code, while adding a
performance improvement (eager page splitting). All of the changes you
are proposing can still be implemented on top if and when they become
a priority (including merging {,__}kvm_find_shadow_page()). And if we
limit eager page splitting to nested MMUs, we don't have to worry
about maintaining eager page splitting with TDP shadow MMU or legacy
shadow paging over time.

>
> Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
> I assume it was originally added to tune the hashing logic?  At this point is it
> anything but wasted cycles?
>
> static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
>                                                      gfn_t gfn,
>                                                      unsigned int gfn_hash,
>                                                      union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
>
>         int collisions = 0;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn) {
>                         collisions++;
>                         continue;
>                 }
>
>                 if (sp->role.word != role.word) {
>                         /*
>                          * If the guest is creating an upper-level page, zap
>                          * unsync pages for the same gfn.  While it's possible
>                          * the guest is using recursive page tables, in all
>                          * likelihood the guest has stopped using the unsync
>                          * page and is installing a completely unrelated page.
>                          * Unsync pages must not be left as is, because the new
>                          * upper-level page will be write-protected.
>                          */
>                         if (role.level > PG_LEVEL_4K && sp->unsync)
>                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>
>                         continue;
>                 }
>
>                 /* unsync and write-flooding only apply to indirect SPs. */
>                 if (sp->role.direct)
>                         goto out;
>
>                 if (sp->unsync) {
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>                          * it doesn't write-protect the page or mark it synchronized!
>                          * This way the validity of the mapping is ensured, but the
>                          * overhead of write protection is not incurred until the
>                          * guest invalidates the TLB mapping.  This allows multiple
>                          * SPs for a single gfn to be unsync.
>                          *
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
>                         if (!kvm_sync_page(vcpu, sp, &invalid_list))
>                                 break;
>
>                         WARN_ON(!list_empty(&invalid_list));
>                         kvm_flush_remote_tlbs(vcpu->kvm);
>                 }
>
>                 __clear_sp_write_flooding_count(sp);
>                 goto out;
>         }
>
>         sp = NULL;
>
> out:
>         if (collisions > kvm->stat.max_mmu_page_hash_collisions)
>                 kvm->stat.max_mmu_page_hash_collisions = collisions;
>
>         kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
>         return sp;
> }
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>                                                       gfn_t gfn,
>                                                       unsigned int gfn_hash,
>                                                       union kvm_mmu_page_role role)
> {
>         struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>
>         ++kvm->stat.mmu_cache_miss;
>
>         sp->gfn = gfn;
>         sp->role = role;
>         sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
>
>         /*
>          * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
>          * depends on valid pages being added to the head of the list.  See
>          * comments in kvm_zap_obsolete_pages().
>          */
>         list_add(&sp->link, &kvm->arch.active_mmu_pages);
>         kvm_mod_used_mmu_pages(kvm, 1);
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         hlist_add_head(&sp->hash_link, sp_list);
>
>         if (!role.direct)
>                 account_shadowed(kvm, slot, sp);
> }
>
>
> static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
>                                                     union kvm_mmu_page_role role)
> {
>         unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
>         struct kvm_mmu_page *sp;
>         bool created = false;
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;
> }