[PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support

Tue Sep 24 11:41:19 EDT 2013

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
> > From: Christoffer Dall <cdall at cs.columbia.edu>
> > 

[ snip ]

> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> > +                         struct kvm_memory_slot *memslot,
> >                           unsigned long fault_status)
> >  {
> > -       pte_t new_pte;
> > -       pfn_t pfn;
> >         int ret;
> > -       bool write_fault, writable;
> > +       bool write_fault, writable, hugetlb = false, force_pte = false;
> >         unsigned long mmu_seq;
> > +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> > +       struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > +       struct vm_area_struct *vma;
> > +       pfn_t pfn;
> > +       unsigned long psize;
> > 
> >         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >         if (fault_status == FSC_PERM && !write_fault) {
> > @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >         }
> > 
> > +       /* Let's check if we will get back a huge page */
> > +       down_read(&current->mm->mmap_sem);
> > +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> > +       if (is_vm_hugetlb_page(vma)) {
> > +               hugetlb = true;
> > +               hva &= PMD_MASK;
> > +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> > +               psize = PMD_SIZE;
> > +       } else {
> > +               psize = PAGE_SIZE;
> > +               if (vma->vm_start & ~PMD_MASK)
> > +                       force_pte = true;
> > +       }
> > +       up_read(&current->mm->mmap_sem);
> > +
> > +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> > +       if (is_error_pfn(pfn))
> > +               return -EFAULT;
> 
> How does this work with respect to the comment that talks about reading
> mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
> we read this too early.
> 
> > +       coherent_icache_guest_page(kvm, hva, psize);
> > +
> >         /* We need minimum second+third level pages */
> >         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
> >         if (ret)
> > @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >          */
> >         smp_rmb();
> > 
> > -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> > -       if (is_error_pfn(pfn))
> > -               return -EFAULT;
> > -
> > -       new_pte = pfn_pte(pfn, PAGE_S2);
> > -       coherent_icache_guest_page(vcpu->kvm, gfn);
> > -
> > -       spin_lock(&vcpu->kvm->mmu_lock);
> > -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> > +       spin_lock(&kvm->mmu_lock);
> > +       if (mmu_notifier_retry(kvm, mmu_seq))
> >                 goto out_unlock;
> > -       if (writable) {
> > -               kvm_set_s2pte_writable(&new_pte);
> > -               kvm_set_pfn_dirty(pfn);
> > +       if (!hugetlb && !force_pte)
> > +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> 
> How do we ensure that there is no race between this pte being promoted
> to a pmd and another page allocation in the same pmd somewhere else in
> the system? We only hold the kvm lock here, so there must be some extra
> implicit guarantee somewhere...
> 

This isn't a promotion to a huge page, it is a mechanism to ensure that
pfn corresponds with the head page of a THP as that's where refcount
information is stored. I think this is safe.

I'm still getting my brain working with kvm, so sorry don't have any
other feedback yet sorry :-).

Cheers,
-- 
Steve