[PATCH v2 6/7] ARM: KVM: switch to a dual-step HYP init code

Tue Apr 9 06:42:56 EDT 2013

On 09/04/13 10:18, Will Deacon wrote:
> On Mon, Apr 08, 2013 at 04:36:42PM +0100, Marc Zyngier wrote:
>> Our HYP init code suffers from two major design issues:
>> - it cannot support CPU hotplug, as we tear down the idmap very early
>> - it cannot perform a TLB invalidation when switching from init to
>>   runtime mappings, as pages are manipulated from PL1 exclusively
>>
>> The hotplug problem mandates that we keep two sets of page tables
>> (boot and runtime). The TLB problem mandates that we're able to
>> transition from one PGD to another while in HYP, invalidating the TLBs
>> in the process.
>>
>> To be able to do this, we need to share a page between the two page
>> tables. A page that will have the same VA in both configurations. All we
>> need is a VA that has the following properties:
>> - This VA can't be used to represent a kernel mapping.
>> - This VA will not conflict with the physical address of the kernel text
>>
>> The vectors page seems to satisfy this requirement:
>> - The kernel never maps anything else there
>> - The kernel text being copied at the beginning of the physical memory,
>>   it is unlikely to use the last 64kB (I doubt we'll ever support KVM
>>   on a system with something like 4MB of RAM, but patches are very
>>   welcome).
>>
>> Let's call this VA the trampoline VA.
>>
>> Now, we map our init page at 3 locations:
>> - idmap in the boot pgd
>> - trampoline VA in the boot pgd
>> - trampoline VA in the runtime pgd
>>
>> The init scenario is now the following:
>> - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>>   runtime stack, runtime vectors
>> - Enable the MMU with the boot pgd
>> - Jump to a target into the trampoline page (remember, this is the same
>>   physical page!)
>> - Now switch to the runtime pgd (same VA, and still the same physical
>>   page!)
>> - Invalidate TLBs
>> - Set stack and vectors
>> - Profit! (or eret, if you only care about the code).
>>
>> Note that we keep the boot mapping permanently (it is not strictly an
>> idmap anymore) to allow for CPU hotplug in later patches.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com>
>> ---
>>  arch/arm/include/asm/kvm_host.h |  18 ++++--
>>  arch/arm/include/asm/kvm_mmu.h  |  24 +++++++-
>>  arch/arm/kvm/arm.c              |  11 ++--
>>  arch/arm/kvm/init.S             |  44 +++++++++++++-
>>  arch/arm/kvm/mmu.c              | 129 ++++++++++++++++++++++++++++------------
>>  5 files changed, 173 insertions(+), 53 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index a856cc2..9fe22ee 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -190,22 +190,32 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
>>  int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>                 int exception_index);
>>
>> -static inline void __cpu_init_hyp_mode(unsigned long long pgd_ptr,
>> +static inline void __cpu_init_hyp_mode(unsigned long long boot_pgd_ptr,
>> +                                      unsigned long long pgd_ptr,
>>                                        unsigned long hyp_stack_ptr,
>>                                        unsigned long vector_ptr)
>>  {
>>         unsigned long pgd_low, pgd_high;
>>
>> -       pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>> -       pgd_high = (pgd_ptr >> 32ULL);
>> +       pgd_low = (boot_pgd_ptr & ((1ULL << 32) - 1));
>> +       pgd_high = (boot_pgd_ptr >> 32ULL);
>>
>>         /*
>>          * Call initialization code, and switch to the full blown
>>          * HYP code. The init code doesn't need to preserve these registers as
>> -        * r1-r3 and r12 are already callee save according to the AAPCS.
>> +        * r1-r3 and r12 are already callee saved according to the AAPCS.
>>          * Note that we slightly misuse the prototype by casing the pgd_low to
>>          * a void *.
>> +        *
>> +        * We don't have enough registers to perform the full init in one go.
>> +        * Install the boot PGD first, and then install the runtime PGD,
>> +        * stack pointer and vectors.
>>          */
>> +       kvm_call_hyp((void *)pgd_low, pgd_high, 0, 0);
>> +
>> +       pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>> +       pgd_high = (pgd_ptr >> 32ULL);
> 
> It might be worth macro-ising the low/high extraction operations now that
> you're using them twice. Hell, you could even make them big-endian clean!

Now you're talking! ;-)

I actually wonder if we could rearrange the code to let the compiler do
the low/high split... Quickly looking through the PCS, it looks like
this would actually work.

>>         kvm_call_hyp((void *)pgd_low, pgd_high, hyp_stack_ptr, vector_ptr);
>>  }
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 92eb20d..24b767a 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -19,17 +19,29 @@
>>  #ifndef __ARM_KVM_MMU_H__
>>  #define __ARM_KVM_MMU_H__
>>
>> -#include <asm/cacheflush.h>
>> -#include <asm/pgalloc.h>
>> +#include <asm/memory.h>
>> +#include <asm/page.h>
>>
>>  /*
>>   * We directly use the kernel VA for the HYP, as we can directly share
>>   * the mapping (HTTBR "covers" TTBR1).
>>   */
>> -#define HYP_PAGE_OFFSET_MASK   (~0UL)
>> +#define HYP_PAGE_OFFSET_MASK   UL(~0)
>>  #define HYP_PAGE_OFFSET                PAGE_OFFSET
>>  #define KERN_TO_HYP(kva)       (kva)
>>
>> +/*
>> + * Our virtual mapping for the boot-time MMU-enable code. Must be
>> + * shared across all the page-tables. Conveniently, we use the vectors
>> + * page, where no kernel data will ever be shared with HYP.
>> + */
>> +#define TRAMPOLINE_VA          UL(CONFIG_VECTORS_BASE)
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +#include <asm/cacheflush.h>
>> +#include <asm/pgalloc.h>
>> +
>>  int create_hyp_mappings(void *from, void *to);
>>  int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>>  void free_hyp_pgds(void);
>> @@ -44,6 +56,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>  void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>>
>>  phys_addr_t kvm_mmu_get_httbr(void);
>> +phys_addr_t kvm_mmu_get_boot_httbr(void);
>> +phys_addr_t kvm_get_idmap_vector(void);
>>  int kvm_mmu_init(void);
>>  void kvm_clear_hyp_idmap(void);
>>
>> @@ -113,4 +127,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>         }
>>  }
>>
>> +#define kvm_flush_dcache_to_poc(a,l)   __cpuc_flush_dcache_area((a), (l))
>> +
>> +#endif /* !__ASSEMBLY__ */
>> +
>>  #endif /* __ARM_KVM_MMU_H__ */
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 8d44ef4..9010a12 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -796,20 +796,22 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>
>>  static void cpu_init_hyp_mode(void *vector)
>>  {
>> +       unsigned long long boot_pgd_ptr;
>>         unsigned long long pgd_ptr;
>>         unsigned long hyp_stack_ptr;
>>         unsigned long stack_page;
>>         unsigned long vector_ptr;
>>
>>         /* Switch from the HYP stub to our own HYP init vector */
>> -       __hyp_set_vectors((unsigned long)vector);
>> +       __hyp_set_vectors(kvm_get_idmap_vector());
>>
>> +       boot_pgd_ptr = (unsigned long long)kvm_mmu_get_boot_httbr();
>>         pgd_ptr = (unsigned long long)kvm_mmu_get_httbr();
> 
> Strictly speaking, could you avoid using two sets of tables for this? That
> is, have code in the trampoline page which remaps the rest of the address
> space using the current tables? Not saying it's feasible, just interested.

If you do that, you loose the ability to bring in a hotplugged CPU, as
your idmap and your runtime page tables may have conflicting
translations. Or am I missing something obvious?

>>         stack_page = __get_cpu_var(kvm_arm_hyp_stack_page);
>>         hyp_stack_ptr = stack_page + PAGE_SIZE;
>>         vector_ptr = (unsigned long)__kvm_hyp_vector;
>>
>> -       __cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);
>> +       __cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
>>  }
>>
>>  /**
>> @@ -863,11 +865,6 @@ static int init_hyp_mode(void)
>>         }
>>
>>         /*
>> -        * Unmap the identity mapping
>> -        */
>> -       kvm_clear_hyp_idmap();
>> -
>> -       /*
>>          * Map the Hyp-code called directly from the host
>>          */
>>         err = create_hyp_mappings(__kvm_hyp_code_start, __kvm_hyp_code_end);
>> diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
>> index 9f37a79..0ca054f 100644
>> --- a/arch/arm/kvm/init.S
>> +++ b/arch/arm/kvm/init.S
>> @@ -21,6 +21,7 @@
>>  #include <asm/asm-offsets.h>
>>  #include <asm/kvm_asm.h>
>>  #include <asm/kvm_arm.h>
>> +#include <asm/kvm_mmu.h>
>>
>>  /********************************************************************
>>   * Hypervisor initialization
>> @@ -28,6 +29,25 @@
>>   *       r0,r1 = Hypervisor pgd pointer
>>   *       r2 = top of Hyp stack (kernel VA)
>>   *       r3 = pointer to hyp vectors
>> + *
>> + * The init scenario is:
>> + * - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>> + *   runtime stack, runtime vectors
>> + * - Enable the MMU with the boot pgd
>> + * - Jump to a target into the trampoline page (remember, this is the same
>> + *   physical page!)
>> + * - Now switch to the runtime pgd (same VA, and still the same physical
>> + *   page!)
>> + * - Invalidate TLBs
>> + * - Set stack and vectors
>> + * - Profit! (or eret, if you only care about the code).
>> + *
>> + * As we only have four registers available to pass parameters (and we
>> + * need six), we split the init in two phases:
>> + * - Phase 1: r0,r1 contain the boot PGD, r2 = 0, r3 = 0.
>> + *   Provides the basic HYP init, and enable the MMU.
>> + * - Phase 2: r0,r1 contain the runtime PGD, r2 = ToS, r3 = vectors
>> + *   Switches to the runtime PGD, set stack and vectors.
>>   */
>>
>>         .text
>> @@ -47,6 +67,9 @@ __kvm_hyp_init:
>>         W(b)    .
>>
>>  __do_hyp_init:
>> +       cmp     r2, #0                  @ We have a SP?
>> +       bne     phase2                  @ Yes, second stage init
>> +
>>         @ Set the HTTBR to point to the hypervisor PGD pointer passed
>>         mcrr    p15, 4, r0, r1, c2
>>
>> @@ -96,14 +119,31 @@ __do_hyp_init:
>>         orr     r0, r0, r1
>>         isb
>>         mcr     p15, 4, r0, c1, c0, 0   @ HSCR
>> -       isb
>>
>> -       @ Set stack pointer and return to the kernel
>> +       @ End of init phase-1
>> +       eret
>> +
>> +phase2:
>> +       @ Set stack pointer
>>         mov     sp, r2
>>
>>         @ Set HVBAR to point to the HYP vectors
>>         mcr     p15, 4, r3, c12, c0, 0  @ HVBAR
>>
>> +       @ Jump to the trampoline page
>> +       ldr     r2, =TRAMPOLINE_VA
>> +       adr     r3, target
>> +       bfi     r2, r3, #0, #PAGE_SHIFT
>> +       mov     pc, r2
>> +
>> +target:        @ We're now in the trampoline code, switch page tables
>> +       mcrr    p15, 4, r0, r1, c2
>> +       isb
>> +
>> +       @ Invalidate the old TLBs
>> +       mcr     p15, 4, r0, c8, c7, 0   @ TLBIALLH
>> +       dsb
>> +
>>         eret
>>
>>         .ltorg
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index c4236e4..b20eff2 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -32,9 +32,15 @@
>>
>>  extern char  __hyp_idmap_text_start[], __hyp_idmap_text_end[];
>>
>> +static pgd_t *boot_hyp_pgd;
>>  static pgd_t *hyp_pgd;
>>  static DEFINE_MUTEX(kvm_hyp_pgd_mutex);
>>
>> +static void *init_bounce_page;
>> +static unsigned long hyp_idmap_start;
>> +static unsigned long hyp_idmap_end;
>> +static phys_addr_t hyp_idmap_vector;
>> +
>>  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>  {
>>         kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>> @@ -108,9 +114,12 @@ static void free_hyp_pgd_entry(pgd_t *pgdp, unsigned long addr)
>>  /**
>>   * free_hyp_pgds - free Hyp-mode page tables
>>   *
>> - * Assumes hyp_pgd is a page table used strictly in Hyp-mode and therefore contains
>> - * either mappings in the kernel memory area (above PAGE_OFFSET), or
>> - * device mappings in the vmalloc range (from VMALLOC_START to VMALLOC_END).
>> + * Assumes hyp_pgd is a page table used strictly in Hyp-mode and
>> + * therefore contains either mappings in the kernel memory area (above
>> + * PAGE_OFFSET), or device mappings in the vmalloc range (from
>> + * VMALLOC_START to VMALLOC_END).
>> + *
>> + * boot_hyp_pgd should only map two pages for the init code.
>>   */
>>  void free_hyp_pgds(void)
>>  {
>> @@ -118,6 +127,12 @@ void free_hyp_pgds(void)
>>
>>         mutex_lock(&kvm_hyp_pgd_mutex);
>>
>> +       if (boot_hyp_pgd) {
>> +               free_hyp_pgd_entry(boot_hyp_pgd, hyp_idmap_start);
>> +               free_hyp_pgd_entry(boot_hyp_pgd, TRAMPOLINE_VA);
>> +               kfree(boot_hyp_pgd);
>> +       }
>> +
>>         if (hyp_pgd) {
>>                 for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
>>                         free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr));
>> @@ -126,6 +141,7 @@ void free_hyp_pgds(void)
>>                 kfree(hyp_pgd);
>>         }
>>
>> +       kfree(init_bounce_page);
>>         mutex_unlock(&kvm_hyp_pgd_mutex);
>>  }
>>
>> @@ -718,14 +734,62 @@ phys_addr_t kvm_mmu_get_httbr(void)
>>         return virt_to_phys(hyp_pgd);
>>  }
>>
>> +phys_addr_t kvm_mmu_get_boot_httbr(void)
>> +{
>> +       VM_BUG_ON(!virt_addr_valid(boot_hyp_pgd));
> 
> Seems a bit OTT -- it's always going to be in lowmem.

Indeed.

>> +       return virt_to_phys(boot_hyp_pgd);
>> +}
>> +
>> +phys_addr_t kvm_get_idmap_vector(void)
>> +{
>> +       return hyp_idmap_vector;
>> +}
>> +
>>  int kvm_mmu_init(void)
>>  {
>> -       unsigned long hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>> -       unsigned long hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>>         int err;
>>
>> +       hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>> +       hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>> +       hyp_idmap_vector = virt_to_phys(__kvm_hyp_init);
>> +
>> +       if ((hyp_idmap_start ^ hyp_idmap_end) & PAGE_MASK) {
>> +               /*
>> +                * Our init code is crossing a page boundary. Allocate
>> +                * a bounce page, copy the code over and use that.
>> +                */
>> +               size_t len = __hyp_idmap_text_end - __hyp_idmap_text_start;
>> +               phys_addr_t phys_base;
>> +
>> +               init_bounce_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> +               if (!init_bounce_page) {
>> +                       kvm_err("Couldn't allocate HYP init bounce page\n");
>> +                       err = -ENOMEM;
>> +                       goto out;
>> +               }
> 
> Given that you don't really need a lowmem page for the bounce page, this
> might be better expressed using alloc_page and kmap for the memcpy.

I'm a bit dubious about that. We have to make sure that the memory is
within the 4GB range, and the only flag I can spot for alloc_page is
GFP_DMA32, which is not exactly what we want, even if it may work.

And yes, we have a problem with platforms having *all* their memory
above 4GB.

> I also wonder whether or not you can eventually free the page and
> corresponding mappings if !CONFIG_HOTPLUG_CPU?

This is indeed possible, as we're sure nothing will use the boot tage
tables once all CPUs have switched to the runtime page tables.

>> +               memcpy(init_bounce_page, __hyp_idmap_text_start, len);
>> +               /*
>> +                * Warning: the code we just copied to the bounce page
>> +                * must be flushed to the point of coherency.
>> +                * Otherwise, the data may be sitting in L2, and HYP
>> +                * mode won't be able to observe it as it runs with
>> +                * caches off at that point.
>> +                */
>> +               kvm_flush_dcache_to_poc(init_bounce_page, len);
>> +
>> +               phys_base = virt_to_phys(init_bounce_page);
>> +               hyp_idmap_vector += phys_base - hyp_idmap_start;
>> +               hyp_idmap_start = phys_base;
>> +               hyp_idmap_end = phys_base + len;
>> +
>> +               kvm_info("Using HYP init bounce page @%lx\n",
>> +                        (unsigned long)phys_base);
>> +       }
>> +
>>         hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>> -       if (!hyp_pgd) {
>> +       boot_hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>> +       if (!hyp_pgd || !boot_hyp_pgd) {
>>                 kvm_err("Hyp mode PGD not allocated\n");
>>                 err = -ENOMEM;
>>                 goto out;
>> @@ -743,39 +807,30 @@ int kvm_mmu_init(void)
>>                 goto out;
>>         }
>>
>> +       /* Map the very same page at the trampoline VA */
>> +       err =   __create_hyp_mappings(boot_hyp_pgd,
>> +                                     TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>> +                                     __phys_to_pfn(hyp_idmap_start),
>> +                                     PAGE_HYP);
>> +       if (err) {
>> +               kvm_err("Failed to map trampoline @%lx into boot HYP pgd\n",
>> +                       TRAMPOLINE_VA);
>> +               goto out;
>> +       }
>> +
>> +       /* Map the same page again into the runtime page tables */
>> +       err =   __create_hyp_mappings(hyp_pgd,
>> +                                     TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>> +                                     __phys_to_pfn(hyp_idmap_start),
>> +                                     PAGE_HYP);
>> +       if (err) {
>> +               kvm_err("Failed to map trampoline @%lx into runtime HYP pgd\n",
>> +                       TRAMPOLINE_VA);
>> +               goto out;
>> +       }
> 
> I'm probably just missing it, but where are the updated page tables flushed
> to memory?

Mumble mumble... We have some partial flush, but clearly not enough of
it to be completely safe. Good spotting.

I'll update the series and send out a v3.

Thanks for reviewing.

	M.
-- 
Jazz is not dead. It just smells funny...