[PATCH v2 6/7] ARM: KVM: switch to a dual-step HYP init code
Marc Zyngier
marc.zyngier at arm.com
Tue Apr 9 06:42:56 EDT 2013
On 09/04/13 10:18, Will Deacon wrote:
> On Mon, Apr 08, 2013 at 04:36:42PM +0100, Marc Zyngier wrote:
>> Our HYP init code suffers from two major design issues:
>> - it cannot support CPU hotplug, as we tear down the idmap very early
>> - it cannot perform a TLB invalidation when switching from init to
>> runtime mappings, as pages are manipulated from PL1 exclusively
>>
>> The hotplug problem mandates that we keep two sets of page tables
>> (boot and runtime). The TLB problem mandates that we're able to
>> transition from one PGD to another while in HYP, invalidating the TLBs
>> in the process.
>>
>> To be able to do this, we need to share a page between the two page
>> tables. A page that will have the same VA in both configurations. All we
>> need is a VA that has the following properties:
>> - This VA can't be used to represent a kernel mapping.
>> - This VA will not conflict with the physical address of the kernel text
>>
>> The vectors page seems to satisfy this requirement:
>> - The kernel never maps anything else there
>> - The kernel text being copied at the beginning of the physical memory,
>> it is unlikely to use the last 64kB (I doubt we'll ever support KVM
>> on a system with something like 4MB of RAM, but patches are very
>> welcome).
>>
>> Let's call this VA the trampoline VA.
>>
>> Now, we map our init page at 3 locations:
>> - idmap in the boot pgd
>> - trampoline VA in the boot pgd
>> - trampoline VA in the runtime pgd
>>
>> The init scenario is now the following:
>> - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>> runtime stack, runtime vectors
>> - Enable the MMU with the boot pgd
>> - Jump to a target into the trampoline page (remember, this is the same
>> physical page!)
>> - Now switch to the runtime pgd (same VA, and still the same physical
>> page!)
>> - Invalidate TLBs
>> - Set stack and vectors
>> - Profit! (or eret, if you only care about the code).
>>
>> Note that we keep the boot mapping permanently (it is not strictly an
>> idmap anymore) to allow for CPU hotplug in later patches.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com>
>> ---
>> arch/arm/include/asm/kvm_host.h | 18 ++++--
>> arch/arm/include/asm/kvm_mmu.h | 24 +++++++-
>> arch/arm/kvm/arm.c | 11 ++--
>> arch/arm/kvm/init.S | 44 +++++++++++++-
>> arch/arm/kvm/mmu.c | 129 ++++++++++++++++++++++++++++------------
>> 5 files changed, 173 insertions(+), 53 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index a856cc2..9fe22ee 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -190,22 +190,32 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
>> int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>> int exception_index);
>>
>> -static inline void __cpu_init_hyp_mode(unsigned long long pgd_ptr,
>> +static inline void __cpu_init_hyp_mode(unsigned long long boot_pgd_ptr,
>> + unsigned long long pgd_ptr,
>> unsigned long hyp_stack_ptr,
>> unsigned long vector_ptr)
>> {
>> unsigned long pgd_low, pgd_high;
>>
>> - pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>> - pgd_high = (pgd_ptr >> 32ULL);
>> + pgd_low = (boot_pgd_ptr & ((1ULL << 32) - 1));
>> + pgd_high = (boot_pgd_ptr >> 32ULL);
>>
>> /*
>> * Call initialization code, and switch to the full blown
>> * HYP code. The init code doesn't need to preserve these registers as
>> - * r1-r3 and r12 are already callee save according to the AAPCS.
>> + * r1-r3 and r12 are already callee saved according to the AAPCS.
>> * Note that we slightly misuse the prototype by casing the pgd_low to
>> * a void *.
>> + *
>> + * We don't have enough registers to perform the full init in one go.
>> + * Install the boot PGD first, and then install the runtime PGD,
>> + * stack pointer and vectors.
>> */
>> + kvm_call_hyp((void *)pgd_low, pgd_high, 0, 0);
>> +
>> + pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>> + pgd_high = (pgd_ptr >> 32ULL);
>
> It might be worth macro-ising the low/high extraction operations now that
> you're using them twice. Hell, you could even make them big-endian clean!
Now you're talking! ;-)
I actually wonder if we could rearrange the code to let the compiler do
the low/high split... Quickly looking through the PCS, it looks like
this would actually work.
>> kvm_call_hyp((void *)pgd_low, pgd_high, hyp_stack_ptr, vector_ptr);
>> }
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 92eb20d..24b767a 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -19,17 +19,29 @@
>> #ifndef __ARM_KVM_MMU_H__
>> #define __ARM_KVM_MMU_H__
>>
>> -#include <asm/cacheflush.h>
>> -#include <asm/pgalloc.h>
>> +#include <asm/memory.h>
>> +#include <asm/page.h>
>>
>> /*
>> * We directly use the kernel VA for the HYP, as we can directly share
>> * the mapping (HTTBR "covers" TTBR1).
>> */
>> -#define HYP_PAGE_OFFSET_MASK (~0UL)
>> +#define HYP_PAGE_OFFSET_MASK UL(~0)
>> #define HYP_PAGE_OFFSET PAGE_OFFSET
>> #define KERN_TO_HYP(kva) (kva)
>>
>> +/*
>> + * Our virtual mapping for the boot-time MMU-enable code. Must be
>> + * shared across all the page-tables. Conveniently, we use the vectors
>> + * page, where no kernel data will ever be shared with HYP.
>> + */
>> +#define TRAMPOLINE_VA UL(CONFIG_VECTORS_BASE)
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +#include <asm/cacheflush.h>
>> +#include <asm/pgalloc.h>
>> +
>> int create_hyp_mappings(void *from, void *to);
>> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>> void free_hyp_pgds(void);
>> @@ -44,6 +56,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>> void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>>
>> phys_addr_t kvm_mmu_get_httbr(void);
>> +phys_addr_t kvm_mmu_get_boot_httbr(void);
>> +phys_addr_t kvm_get_idmap_vector(void);
>> int kvm_mmu_init(void);
>> void kvm_clear_hyp_idmap(void);
>>
>> @@ -113,4 +127,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> }
>> }
>>
>> +#define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l))
>> +
>> +#endif /* !__ASSEMBLY__ */
>> +
>> #endif /* __ARM_KVM_MMU_H__ */
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 8d44ef4..9010a12 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -796,20 +796,22 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>
>> static void cpu_init_hyp_mode(void *vector)
>> {
>> + unsigned long long boot_pgd_ptr;
>> unsigned long long pgd_ptr;
>> unsigned long hyp_stack_ptr;
>> unsigned long stack_page;
>> unsigned long vector_ptr;
>>
>> /* Switch from the HYP stub to our own HYP init vector */
>> - __hyp_set_vectors((unsigned long)vector);
>> + __hyp_set_vectors(kvm_get_idmap_vector());
>>
>> + boot_pgd_ptr = (unsigned long long)kvm_mmu_get_boot_httbr();
>> pgd_ptr = (unsigned long long)kvm_mmu_get_httbr();
>
> Strictly speaking, could you avoid using two sets of tables for this? That
> is, have code in the trampoline page which remaps the rest of the address
> space using the current tables? Not saying it's feasible, just interested.
If you do that, you loose the ability to bring in a hotplugged CPU, as
your idmap and your runtime page tables may have conflicting
translations. Or am I missing something obvious?
>> stack_page = __get_cpu_var(kvm_arm_hyp_stack_page);
>> hyp_stack_ptr = stack_page + PAGE_SIZE;
>> vector_ptr = (unsigned long)__kvm_hyp_vector;
>>
>> - __cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);
>> + __cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
>> }
>>
>> /**
>> @@ -863,11 +865,6 @@ static int init_hyp_mode(void)
>> }
>>
>> /*
>> - * Unmap the identity mapping
>> - */
>> - kvm_clear_hyp_idmap();
>> -
>> - /*
>> * Map the Hyp-code called directly from the host
>> */
>> err = create_hyp_mappings(__kvm_hyp_code_start, __kvm_hyp_code_end);
>> diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
>> index 9f37a79..0ca054f 100644
>> --- a/arch/arm/kvm/init.S
>> +++ b/arch/arm/kvm/init.S
>> @@ -21,6 +21,7 @@
>> #include <asm/asm-offsets.h>
>> #include <asm/kvm_asm.h>
>> #include <asm/kvm_arm.h>
>> +#include <asm/kvm_mmu.h>
>>
>> /********************************************************************
>> * Hypervisor initialization
>> @@ -28,6 +29,25 @@
>> * r0,r1 = Hypervisor pgd pointer
>> * r2 = top of Hyp stack (kernel VA)
>> * r3 = pointer to hyp vectors
>> + *
>> + * The init scenario is:
>> + * - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>> + * runtime stack, runtime vectors
>> + * - Enable the MMU with the boot pgd
>> + * - Jump to a target into the trampoline page (remember, this is the same
>> + * physical page!)
>> + * - Now switch to the runtime pgd (same VA, and still the same physical
>> + * page!)
>> + * - Invalidate TLBs
>> + * - Set stack and vectors
>> + * - Profit! (or eret, if you only care about the code).
>> + *
>> + * As we only have four registers available to pass parameters (and we
>> + * need six), we split the init in two phases:
>> + * - Phase 1: r0,r1 contain the boot PGD, r2 = 0, r3 = 0.
>> + * Provides the basic HYP init, and enable the MMU.
>> + * - Phase 2: r0,r1 contain the runtime PGD, r2 = ToS, r3 = vectors
>> + * Switches to the runtime PGD, set stack and vectors.
>> */
>>
>> .text
>> @@ -47,6 +67,9 @@ __kvm_hyp_init:
>> W(b) .
>>
>> __do_hyp_init:
>> + cmp r2, #0 @ We have a SP?
>> + bne phase2 @ Yes, second stage init
>> +
>> @ Set the HTTBR to point to the hypervisor PGD pointer passed
>> mcrr p15, 4, r0, r1, c2
>>
>> @@ -96,14 +119,31 @@ __do_hyp_init:
>> orr r0, r0, r1
>> isb
>> mcr p15, 4, r0, c1, c0, 0 @ HSCR
>> - isb
>>
>> - @ Set stack pointer and return to the kernel
>> + @ End of init phase-1
>> + eret
>> +
>> +phase2:
>> + @ Set stack pointer
>> mov sp, r2
>>
>> @ Set HVBAR to point to the HYP vectors
>> mcr p15, 4, r3, c12, c0, 0 @ HVBAR
>>
>> + @ Jump to the trampoline page
>> + ldr r2, =TRAMPOLINE_VA
>> + adr r3, target
>> + bfi r2, r3, #0, #PAGE_SHIFT
>> + mov pc, r2
>> +
>> +target: @ We're now in the trampoline code, switch page tables
>> + mcrr p15, 4, r0, r1, c2
>> + isb
>> +
>> + @ Invalidate the old TLBs
>> + mcr p15, 4, r0, c8, c7, 0 @ TLBIALLH
>> + dsb
>> +
>> eret
>>
>> .ltorg
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index c4236e4..b20eff2 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -32,9 +32,15 @@
>>
>> extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
>>
>> +static pgd_t *boot_hyp_pgd;
>> static pgd_t *hyp_pgd;
>> static DEFINE_MUTEX(kvm_hyp_pgd_mutex);
>>
>> +static void *init_bounce_page;
>> +static unsigned long hyp_idmap_start;
>> +static unsigned long hyp_idmap_end;
>> +static phys_addr_t hyp_idmap_vector;
>> +
>> static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>> {
>> kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>> @@ -108,9 +114,12 @@ static void free_hyp_pgd_entry(pgd_t *pgdp, unsigned long addr)
>> /**
>> * free_hyp_pgds - free Hyp-mode page tables
>> *
>> - * Assumes hyp_pgd is a page table used strictly in Hyp-mode and therefore contains
>> - * either mappings in the kernel memory area (above PAGE_OFFSET), or
>> - * device mappings in the vmalloc range (from VMALLOC_START to VMALLOC_END).
>> + * Assumes hyp_pgd is a page table used strictly in Hyp-mode and
>> + * therefore contains either mappings in the kernel memory area (above
>> + * PAGE_OFFSET), or device mappings in the vmalloc range (from
>> + * VMALLOC_START to VMALLOC_END).
>> + *
>> + * boot_hyp_pgd should only map two pages for the init code.
>> */
>> void free_hyp_pgds(void)
>> {
>> @@ -118,6 +127,12 @@ void free_hyp_pgds(void)
>>
>> mutex_lock(&kvm_hyp_pgd_mutex);
>>
>> + if (boot_hyp_pgd) {
>> + free_hyp_pgd_entry(boot_hyp_pgd, hyp_idmap_start);
>> + free_hyp_pgd_entry(boot_hyp_pgd, TRAMPOLINE_VA);
>> + kfree(boot_hyp_pgd);
>> + }
>> +
>> if (hyp_pgd) {
>> for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
>> free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr));
>> @@ -126,6 +141,7 @@ void free_hyp_pgds(void)
>> kfree(hyp_pgd);
>> }
>>
>> + kfree(init_bounce_page);
>> mutex_unlock(&kvm_hyp_pgd_mutex);
>> }
>>
>> @@ -718,14 +734,62 @@ phys_addr_t kvm_mmu_get_httbr(void)
>> return virt_to_phys(hyp_pgd);
>> }
>>
>> +phys_addr_t kvm_mmu_get_boot_httbr(void)
>> +{
>> + VM_BUG_ON(!virt_addr_valid(boot_hyp_pgd));
>
> Seems a bit OTT -- it's always going to be in lowmem.
Indeed.
>> + return virt_to_phys(boot_hyp_pgd);
>> +}
>> +
>> +phys_addr_t kvm_get_idmap_vector(void)
>> +{
>> + return hyp_idmap_vector;
>> +}
>> +
>> int kvm_mmu_init(void)
>> {
>> - unsigned long hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>> - unsigned long hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>> int err;
>>
>> + hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>> + hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>> + hyp_idmap_vector = virt_to_phys(__kvm_hyp_init);
>> +
>> + if ((hyp_idmap_start ^ hyp_idmap_end) & PAGE_MASK) {
>> + /*
>> + * Our init code is crossing a page boundary. Allocate
>> + * a bounce page, copy the code over and use that.
>> + */
>> + size_t len = __hyp_idmap_text_end - __hyp_idmap_text_start;
>> + phys_addr_t phys_base;
>> +
>> + init_bounce_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> + if (!init_bounce_page) {
>> + kvm_err("Couldn't allocate HYP init bounce page\n");
>> + err = -ENOMEM;
>> + goto out;
>> + }
>
> Given that you don't really need a lowmem page for the bounce page, this
> might be better expressed using alloc_page and kmap for the memcpy.
I'm a bit dubious about that. We have to make sure that the memory is
within the 4GB range, and the only flag I can spot for alloc_page is
GFP_DMA32, which is not exactly what we want, even if it may work.
And yes, we have a problem with platforms having *all* their memory
above 4GB.
> I also wonder whether or not you can eventually free the page and
> corresponding mappings if !CONFIG_HOTPLUG_CPU?
This is indeed possible, as we're sure nothing will use the boot tage
tables once all CPUs have switched to the runtime page tables.
>> + memcpy(init_bounce_page, __hyp_idmap_text_start, len);
>> + /*
>> + * Warning: the code we just copied to the bounce page
>> + * must be flushed to the point of coherency.
>> + * Otherwise, the data may be sitting in L2, and HYP
>> + * mode won't be able to observe it as it runs with
>> + * caches off at that point.
>> + */
>> + kvm_flush_dcache_to_poc(init_bounce_page, len);
>> +
>> + phys_base = virt_to_phys(init_bounce_page);
>> + hyp_idmap_vector += phys_base - hyp_idmap_start;
>> + hyp_idmap_start = phys_base;
>> + hyp_idmap_end = phys_base + len;
>> +
>> + kvm_info("Using HYP init bounce page @%lx\n",
>> + (unsigned long)phys_base);
>> + }
>> +
>> hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>> - if (!hyp_pgd) {
>> + boot_hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>> + if (!hyp_pgd || !boot_hyp_pgd) {
>> kvm_err("Hyp mode PGD not allocated\n");
>> err = -ENOMEM;
>> goto out;
>> @@ -743,39 +807,30 @@ int kvm_mmu_init(void)
>> goto out;
>> }
>>
>> + /* Map the very same page at the trampoline VA */
>> + err = __create_hyp_mappings(boot_hyp_pgd,
>> + TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>> + __phys_to_pfn(hyp_idmap_start),
>> + PAGE_HYP);
>> + if (err) {
>> + kvm_err("Failed to map trampoline @%lx into boot HYP pgd\n",
>> + TRAMPOLINE_VA);
>> + goto out;
>> + }
>> +
>> + /* Map the same page again into the runtime page tables */
>> + err = __create_hyp_mappings(hyp_pgd,
>> + TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>> + __phys_to_pfn(hyp_idmap_start),
>> + PAGE_HYP);
>> + if (err) {
>> + kvm_err("Failed to map trampoline @%lx into runtime HYP pgd\n",
>> + TRAMPOLINE_VA);
>> + goto out;
>> + }
>
> I'm probably just missing it, but where are the updated page tables flushed
> to memory?
Mumble mumble... We have some partial flush, but clearly not enough of
it to be completely safe. Good spotting.
I'll update the series and send out a v3.
Thanks for reviewing.
M.
--
Jazz is not dead. It just smells funny...
More information about the linux-arm-kernel
mailing list