[PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2

Wed Oct 8 02:47:44 PDT 2014

On Wed, Oct 08, 2014 at 10:34:31AM +0100, Marc Zyngier wrote:
> On 07/10/14 20:39, Christoffer Dall wrote:
> > On Tue, Oct 07, 2014 at 02:28:43PM +0100, Marc Zyngier wrote:
> >> On 07/10/14 11:48, Catalin Marinas wrote:
> >>> On Mon, Oct 06, 2014 at 09:30:25PM +0100, Christoffer Dall wrote:
> >>>> +/**
> >>>> + * kvm_prealloc_hwpgd - allocate inital table for VTTBR
> >>>> + * @kvm:       The KVM struct pointer for the VM.
> >>>> + * @pgd:       The kernel pseudo pgd
> >>>> + *
> >>>> + * When the kernel uses more levels of page tables than the guest, we allocate
> >>>> + * a fake PGD and pre-populate it to point to the next-level page table, which
> >>>> + * will be the real initial page table pointed to by the VTTBR.
> >>>> + *
> >>>> + * When KVM_PREALLOC_LEVEL==2, we allocate a single page for the PMD and
> >>>> + * the kernel will use folded pud.  When KVM_PREALLOC_LEVEL==1, we
> >>>> + * allocate 2 consecutive PUD pages.
> >>>> + */
> >>>> +#if defined(CONFIG_ARM64_64K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 3
> >>>> +#define KVM_PREALLOC_LEVEL     2
> >>>> +#define PTRS_PER_S2_PGD                1
> >>>> +#define S2_PGD_ORDER           get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
> >>>
> >>> I agree that my magic equation wasn't readable ;) (I had troubles
> >>> re-understanding it as well), but you also have some constants here that
> >>> are not immediately obvious where you got to them from. IIUC,
> >>> KVM_PREALLOC_LEVEL == 2 here means that the hardware only understands
> >>> stage 2 pmd and pte. I guess you could look into the ARM ARM tables but
> >>> it's still not clear.
> >>>
> >>> Let's look at PTRS_PER_S2_PGD as I think it's simpler. My proposal was:
> >>>
> >>> #if PGDIR_SHIFT > KVM_PHYS_SHIFT
> >>> #define PTRS_PER_S2_PGD                     (1)
> >>> #else
> >>> #define PTRS_PER_S2_PGD                     (1 << (KVM_PHYS_SHIFT - PGDIR_SHIFT))
> >>> #endif
> >>>
> >>> In this case PGDIR_SHIFT is 42, so we get PTRS_PER_S2_PGD == 1. The 4K
> >>> and 4 levels case below is also correct.
> >>>
> >>> The KVM start level calculation, we could assume that KVM needs either
> >>> host levels or host levels - 1 (unless we go for some weirdly small
> >>> KVM_PHYS_SHIFT). So we could define them KVM_PREALLOC_LEVEL as:
> >>>
> >>> #if PTRS_PER_S2_PGD <= 16
> >>> #define KVM_PREALLOC_LEVEL  (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1)
> >>> #else
> >>> #define KVM_PREALLOC_LEVEL  (0)
> >>> #endif
> >>>
> >>> Basically if you can concatenate 16 or less pages at the level below the
> >>> top, the architecture does not allow a small top level. In this case,
> >>> (4 - CONFIG_ARM64_PGTABLE_LEVELS) represents the first level for the
> >>> host and we add 1 to go to the next level for KVM stage 2 when
> >>> PTRS_PER_S2_PGD is 16 or less. We use 0 when we don't need to
> >>> preallocate.
> >>
> >> I think this makes the whole thing clearer (at least for me), as it
> >> makes the relationship between KVM_PREALLOC_LEVEL and
> >> CONFIG_ARM64_PGTABLE_LEVELS explicit (it wasn't completely obvious to me
> >> initially).
> > 
> > Agreed.
> > 
> >>
> >>>> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
> >>>> +{
> >>>> +       pud_t *pud;
> >>>> +       pmd_t *pmd;
> >>>> +
> >>>> +       pud = pud_offset(pgd, 0);
> >>>> +       pmd = (pmd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0);
> >>>> +
> >>>> +       if (!pmd)
> >>>> +               return -ENOMEM;
> >>>> +       pud_populate(NULL, pud, pmd);
> >>>> +
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +static inline void kvm_free_hwpgd(struct kvm *kvm)
> >>>> +{
> >>>> +       pgd_t *pgd = kvm->arch.pgd;
> >>>> +       pud_t *pud = pud_offset(pgd, 0);
> >>>> +       pmd_t *pmd = pmd_offset(pud, 0);
> >>>> +       free_pages((unsigned long)pmd, 0);
> >>>> +}
> >>>> +
> >>>> +static inline phys_addr_t kvm_get_hwpgd(struct kvm *kvm)
> >>>> +{
> >>>> +       pgd_t *pgd = kvm->arch.pgd;
> >>>> +       pud_t *pud = pud_offset(pgd, 0);
> >>>> +       pmd_t *pmd = pmd_offset(pud, 0);
> >>>> +       return virt_to_phys(pmd);
> >>>> +
> >>>> +}
> >>>> +#elif defined(CONFIG_ARM64_4K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 4
> >>>> +#define KVM_PREALLOC_LEVEL     1
> >>>> +#define PTRS_PER_S2_PGD                2
> >>>> +#define S2_PGD_ORDER           get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
> >>>
> >>> Here PGDIR_SHIFT is 39, so we get PTRS_PER_S2_PGD == (1 << (40 - 39))
> >>> which is 2 and KVM_PREALLOC_LEVEL == 1.
> >>>
> >>>> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
> >>>> +{
> >>>> +       pud_t *pud;
> >>>> +
> >>>> +       pud = (pud_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1);
> >>>> +       if (!pud)
> >>>> +               return -ENOMEM;
> >>>> +       pgd_populate(NULL, pgd, pud);
> >>>> +       pgd_populate(NULL, pgd + 1, pud + PTRS_PER_PUD);
> >>>> +
> >>>> +       return 0;
> >>>> +}
> >>>
> >>> You still need to define these functions but you can make their
> >>> implementation dependent solely on the KVM_PREALLOC_LEVEL rather than
> >>> 64K/4K and levels combinations. If it is KVM_PREALLOC_LEVEL is 1, you
> >>> allocate pud and populate the pgds (in a loop based on the
> >>> PTRS_PER_S2_PGD). If it is 2, you allocate the pmd and populate the pud
> >>> (still in a loop though it would probably be 1 iteration). We know based
> >>> on the assumption above that you can't get KVM_PREALLOC_LEVEL == 2 and
> >>> CONFIG_ARM64_PGTABLE_LEVELS == 4.
> >>>
> >>
> >> Also agreed. Most of what you wrote here could also be gathered as
> >> comments in the patch.
> >>
> > Yes, I reworded some of the text slightly as comments for the next
> > version of the patch.
> > 
> > However, I'm not sure I have a clear idea of how you'd like these
> > functions to look like.
> > 
> > I came up with the following based on your feedback, but I personally
> > don't find it a lot easier to read than what I had already.  Suggestions
> > are welcome:
> > 
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index a030d16..7941a51 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -41,6 +41,18 @@
> >   */
> >  #define TRAMPOLINE_VA          (HYP_PAGE_OFFSET_MASK & PAGE_MASK)
> > 
> > +/*
> > + * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> > + * levels in addition to the PGD and potentially the PUD which are
> > + * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
> > + * tables use one level of tables less than the kernel.
> > + */
> > +#ifdef CONFIG_ARM64_64K_PAGES
> > +#define KVM_MMU_CACHE_MIN_PAGES        1
> > +#else
> > +#define KVM_MMU_CACHE_MIN_PAGES        2
> > +#endif
> > +
> >  #ifdef __ASSEMBLY__
> > 
> >  /*
> > @@ -53,6 +65,7 @@
> > 
> >  #else
> > 
> > +#include <asm/pgalloc.h>
> >  #include <asm/cachetype.h>
> >  #include <asm/cacheflush.h>
> > 
> > @@ -65,10 +78,6 @@
> >  #define KVM_PHYS_SIZE  (1UL << KVM_PHYS_SHIFT)
> >  #define KVM_PHYS_MASK  (KVM_PHYS_SIZE - 1UL)
> > 
> > -/* Make sure we get the right size, and thus the right alignment */
> > -#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - PGDIR_SHIFT))
> > -#define S2_PGD_ORDER   get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
> > -
> >  int create_hyp_mappings(void *from, void *to);
> >  int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
> >  void free_boot_hyp_pgd(void);
> > @@ -93,6 +102,7 @@ void kvm_clear_hyp_idmap(void);
> >  #define        kvm_set_pmd(pmdp, pmd)          set_pmd(pmdp, pmd)
> > 
> >  static inline void kvm_clean_pgd(pgd_t *pgd) {}
> > +static inline void kvm_clean_pmd(pmd_t *pmd) {}
> >  static inline void kvm_clean_pmd_entry(pmd_t *pmd) {}
> >  static inline void kvm_clean_pte(pte_t *pte) {}
> >  static inline void kvm_clean_pte_entry(pte_t *pte) {}
> > @@ -118,13 +128,115 @@ static inline bool kvm_page_empty(void *ptr)
> >  }
> > 
> >  #define kvm_pte_table_empty(ptep) kvm_page_empty(ptep)
> > -#ifndef CONFIG_ARM64_64K_PAGES
> > -#define kvm_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> > -#else
> > +
> > +#ifdef __PAGETABLE_PMD_FOLDED
> >  #define kvm_pmd_table_empty(pmdp) (0)
> > +#else
> > +#define kvm_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> >  #endif
> > +
> > +#ifdef __PAGETABLE_PUD_FOLDED
> >  #define kvm_pud_table_empty(pudp) (0)
> > +#else
> > +#define kvm_pud_table_empty(pudp) kvm_page_empty(pudp)
> > +#endif
> > +
> > +/*
> > + * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
> > + * the entire IPA input range with a single pgd entry, and we would only need
> > + * one pgd entry.
> > + */
> > +#if PGDIR_SHIFT > KVM_PHYS_SHIFT
> > +#define PTRS_PER_S2_PGD                (1)
> > +#else
> > +#define PTRS_PER_S2_PGD                (1 << (KVM_PHYS_SHIFT - PGDIR_SHIFT))
> > +#endif
> > +#define S2_PGD_ORDER           get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
> > 
> > +/*
> > + * If we are concatenating first level stage-2 page tables, we would have less
> > + * than or equal to 16 pointers in the fake PGD, because that's what the
> > + * architecture allows.  In this case, (4 - CONFIG_ARM64_PGTABLE_LEVELS)
> > + * represents the first level for the host, and we add 1 to go to the next
> > + * level (which uses contatenation) for the stage-2 tables.
> > + */
> > +#if PTRS_PER_S2_PGD <= 16
> > +#define KVM_PREALLOC_LEVEL     (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1)
> > +#else
> > +#define KVM_PREALLOC_LEVEL     (0)
> > +#endif
> > +
> > +/**
> > + * kvm_prealloc_hwpgd - allocate inital table for VTTBR
> > + * @kvm:       The KVM struct pointer for the VM.
> > + * @pgd:       The kernel pseudo pgd
> > + *
> > + * When the kernel uses more levels of page tables than the guest, we allocate
> > + * a fake PGD and pre-populate it to point to the next-level page table, which
> > + * will be the real initial page table pointed to by the VTTBR.
> > + *
> > + * When KVM_PREALLOC_LEVEL==2, we allocate a single page for the PMD and
> > + * the kernel will use folded pud.  When KVM_PREALLOC_LEVEL==1, we
> > + * allocate 2 consecutive PUD pages.
> > + */
> > +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
> > +{
> > +       pud_t *pud;
> > +       pmd_t *pmd;
> > +       unsigned int order, i;
> > +       unsigned long hwpgd;
> > +
> > +       if (KVM_PREALLOC_LEVEL == 0)
> > +               return 0;
> > +
> > +       order = get_order(PTRS_PER_S2_PGD);
> 
> S2_PGD_ORDER instead? Otherwise, that doesn't seem quite right...
> 

no, S2_PGD_ORDER is always the order of the PGD (in linux
macro-world-pgd-terms) that we allocate currently.  Of course, we could
rework that like this:

#if PTRS_PER_S2_PGD <= 16
#define KVM_PREALLOC_LEVEL     (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1)
#define S2_PGD_ORDER           get_order(PTRS_PER_S2_PGD)
#else
#define KVM_PREALLOC_LEVEL     (0)
#define S2_PGD_ORDER           get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
#endif

> > +       hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > +       if (!hwpgd)
> > +               return -ENOMEM;
> > +
> > +       if (KVM_PREALLOC_LEVEL == 1) {
> > +               pud = (pud_t *)hwpgd;
> > +               for (i = 0; i < PTRS_PER_S2_PGD; i++)
> > +                       pgd_populate(NULL, pgd + i, pud + i * PTRS_PER_PUD);
> > +       } else if (KVM_PREALLOC_LEVEL == 2) {
> > +               pud = pud_offset(pgd, 0);
> > +               pmd = (pmd_t *)hwpgd;
> > +               for (i = 0; i < PTRS_PER_S2_PGD; i++)
> > +                       pud_populate(NULL, pud + i, pmd + i * PTRS_PER_PMD);
> > +       }
> > +
> > +       return 0;
> 
> Shouldn't we return an error here instead? Or BUG()?
> 

yes, should never happen.

> > +}
> > +
> > +static inline void *kvm_get_hwpgd(struct kvm *kvm)
> > +{
> > +       pgd_t *pgd = kvm->arch.pgd;
> > +       pud_t *pud;
> > +       pmd_t *pmd;
> > +
> > +       switch (KVM_PREALLOC_LEVEL) {
> > +       case 0:
> > +               return pgd;
> > +       case 1:
> > +               pud = pud_offset(pgd, 0);
> > +               return pud;
> > +       case 2:
> > +               pud = pud_offset(pgd, 0);
> > +               pmd = pmd_offset(pud, 0);
> > +               return pmd;
> > +       default:
> > +               BUG();
> > +               return NULL;
> > +       }
> > +}
> > +
> > +static inline void kvm_free_hwpgd(struct kvm *kvm)
> > +{
> > +       if (KVM_PREALLOC_LEVEL > 0) {
> > +               unsigned long hwpgd = (unsigned long)kvm_get_hwpgd(kvm);
> > +               free_pages(hwpgd, get_order(S2_PGD_ORDER));
> 
> Isn't the get_order() a bit wrong here? I'd expect S2_PGD_ORDER to be
> what we need already...
> 

yikes!  gonzo coding.

what it should be is 
	free_pages(hwpgd, get_order(PTRS_PER_S2_PGD));

but it ties into the discussion above.

> > +       }
> > +}
> 
> I personally like this version more (Catalin may have a different
> opinion ;-).
> 
I'm fine with this, but I wasn't sure if you guys had something more
clever/beautiful in mind....?

-Christoffer