[PATCH v2 6/7] ARM: KVM: switch to a dual-step HYP init code
Christoffer Dall
cdall at cs.columbia.edu
Tue Apr 9 14:28:07 EDT 2013
On Tue, Apr 9, 2013 at 3:42 AM, Marc Zyngier <marc.zyngier at arm.com> wrote:
> On 09/04/13 10:18, Will Deacon wrote:
>> On Mon, Apr 08, 2013 at 04:36:42PM +0100, Marc Zyngier wrote:
>>> Our HYP init code suffers from two major design issues:
>>> - it cannot support CPU hotplug, as we tear down the idmap very early
>>> - it cannot perform a TLB invalidation when switching from init to
>>> runtime mappings, as pages are manipulated from PL1 exclusively
>>>
>>> The hotplug problem mandates that we keep two sets of page tables
>>> (boot and runtime). The TLB problem mandates that we're able to
>>> transition from one PGD to another while in HYP, invalidating the TLBs
>>> in the process.
>>>
>>> To be able to do this, we need to share a page between the two page
>>> tables. A page that will have the same VA in both configurations. All we
>>> need is a VA that has the following properties:
>>> - This VA can't be used to represent a kernel mapping.
>>> - This VA will not conflict with the physical address of the kernel text
>>>
>>> The vectors page seems to satisfy this requirement:
>>> - The kernel never maps anything else there
>>> - The kernel text being copied at the beginning of the physical memory,
>>> it is unlikely to use the last 64kB (I doubt we'll ever support KVM
>>> on a system with something like 4MB of RAM, but patches are very
>>> welcome).
>>>
>>> Let's call this VA the trampoline VA.
>>>
>>> Now, we map our init page at 3 locations:
>>> - idmap in the boot pgd
>>> - trampoline VA in the boot pgd
>>> - trampoline VA in the runtime pgd
>>>
>>> The init scenario is now the following:
>>> - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>>> runtime stack, runtime vectors
>>> - Enable the MMU with the boot pgd
>>> - Jump to a target into the trampoline page (remember, this is the same
>>> physical page!)
>>> - Now switch to the runtime pgd (same VA, and still the same physical
>>> page!)
>>> - Invalidate TLBs
>>> - Set stack and vectors
>>> - Profit! (or eret, if you only care about the code).
>>>
>>> Note that we keep the boot mapping permanently (it is not strictly an
>>> idmap anymore) to allow for CPU hotplug in later patches.
>>>
>>> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com>
>>> ---
>>> arch/arm/include/asm/kvm_host.h | 18 ++++--
>>> arch/arm/include/asm/kvm_mmu.h | 24 +++++++-
>>> arch/arm/kvm/arm.c | 11 ++--
>>> arch/arm/kvm/init.S | 44 +++++++++++++-
>>> arch/arm/kvm/mmu.c | 129 ++++++++++++++++++++++++++++------------
>>> 5 files changed, 173 insertions(+), 53 deletions(-)
>>>
>>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>>> index a856cc2..9fe22ee 100644
>>> --- a/arch/arm/include/asm/kvm_host.h
>>> +++ b/arch/arm/include/asm/kvm_host.h
>>> @@ -190,22 +190,32 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
>>> int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>> int exception_index);
>>>
>>> -static inline void __cpu_init_hyp_mode(unsigned long long pgd_ptr,
>>> +static inline void __cpu_init_hyp_mode(unsigned long long boot_pgd_ptr,
>>> + unsigned long long pgd_ptr,
>>> unsigned long hyp_stack_ptr,
>>> unsigned long vector_ptr)
>>> {
>>> unsigned long pgd_low, pgd_high;
>>>
>>> - pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>>> - pgd_high = (pgd_ptr >> 32ULL);
>>> + pgd_low = (boot_pgd_ptr & ((1ULL << 32) - 1));
>>> + pgd_high = (boot_pgd_ptr >> 32ULL);
>>>
>>> /*
>>> * Call initialization code, and switch to the full blown
>>> * HYP code. The init code doesn't need to preserve these registers as
>>> - * r1-r3 and r12 are already callee save according to the AAPCS.
>>> + * r1-r3 and r12 are already callee saved according to the AAPCS.
>>> * Note that we slightly misuse the prototype by casing the pgd_low to
>>> * a void *.
>>> + *
>>> + * We don't have enough registers to perform the full init in one go.
>>> + * Install the boot PGD first, and then install the runtime PGD,
>>> + * stack pointer and vectors.
>>> */
>>> + kvm_call_hyp((void *)pgd_low, pgd_high, 0, 0);
>>> +
>>> + pgd_low = (pgd_ptr & ((1ULL << 32) - 1));
>>> + pgd_high = (pgd_ptr >> 32ULL);
>>
>> It might be worth macro-ising the low/high extraction operations now that
>> you're using them twice. Hell, you could even make them big-endian clean!
>
> Now you're talking! ;-)
>
> I actually wonder if we could rearrange the code to let the compiler do
> the low/high split... Quickly looking through the PCS, it looks like
> this would actually work.
>
>>> kvm_call_hyp((void *)pgd_low, pgd_high, hyp_stack_ptr, vector_ptr);
>>> }
>>>
>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>>> index 92eb20d..24b767a 100644
>>> --- a/arch/arm/include/asm/kvm_mmu.h
>>> +++ b/arch/arm/include/asm/kvm_mmu.h
>>> @@ -19,17 +19,29 @@
>>> #ifndef __ARM_KVM_MMU_H__
>>> #define __ARM_KVM_MMU_H__
>>>
>>> -#include <asm/cacheflush.h>
>>> -#include <asm/pgalloc.h>
>>> +#include <asm/memory.h>
>>> +#include <asm/page.h>
>>>
>>> /*
>>> * We directly use the kernel VA for the HYP, as we can directly share
>>> * the mapping (HTTBR "covers" TTBR1).
>>> */
>>> -#define HYP_PAGE_OFFSET_MASK (~0UL)
>>> +#define HYP_PAGE_OFFSET_MASK UL(~0)
>>> #define HYP_PAGE_OFFSET PAGE_OFFSET
>>> #define KERN_TO_HYP(kva) (kva)
>>>
>>> +/*
>>> + * Our virtual mapping for the boot-time MMU-enable code. Must be
>>> + * shared across all the page-tables. Conveniently, we use the vectors
>>> + * page, where no kernel data will ever be shared with HYP.
>>> + */
>>> +#define TRAMPOLINE_VA UL(CONFIG_VECTORS_BASE)
>>> +
>>> +#ifndef __ASSEMBLY__
>>> +
>>> +#include <asm/cacheflush.h>
>>> +#include <asm/pgalloc.h>
>>> +
>>> int create_hyp_mappings(void *from, void *to);
>>> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>>> void free_hyp_pgds(void);
>>> @@ -44,6 +56,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>> void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>>>
>>> phys_addr_t kvm_mmu_get_httbr(void);
>>> +phys_addr_t kvm_mmu_get_boot_httbr(void);
>>> +phys_addr_t kvm_get_idmap_vector(void);
>>> int kvm_mmu_init(void);
>>> void kvm_clear_hyp_idmap(void);
>>>
>>> @@ -113,4 +127,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>> }
>>> }
>>>
>>> +#define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l))
>>> +
>>> +#endif /* !__ASSEMBLY__ */
>>> +
>>> #endif /* __ARM_KVM_MMU_H__ */
>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>> index 8d44ef4..9010a12 100644
>>> --- a/arch/arm/kvm/arm.c
>>> +++ b/arch/arm/kvm/arm.c
>>> @@ -796,20 +796,22 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>>
>>> static void cpu_init_hyp_mode(void *vector)
>>> {
>>> + unsigned long long boot_pgd_ptr;
>>> unsigned long long pgd_ptr;
>>> unsigned long hyp_stack_ptr;
>>> unsigned long stack_page;
>>> unsigned long vector_ptr;
>>>
>>> /* Switch from the HYP stub to our own HYP init vector */
>>> - __hyp_set_vectors((unsigned long)vector);
>>> + __hyp_set_vectors(kvm_get_idmap_vector());
>>>
>>> + boot_pgd_ptr = (unsigned long long)kvm_mmu_get_boot_httbr();
>>> pgd_ptr = (unsigned long long)kvm_mmu_get_httbr();
>>
>> Strictly speaking, could you avoid using two sets of tables for this? That
>> is, have code in the trampoline page which remaps the rest of the address
>> space using the current tables? Not saying it's feasible, just interested.
>
> If you do that, you loose the ability to bring in a hotplugged CPU, as
> your idmap and your runtime page tables may have conflicting
> translations. Or am I missing something obvious?
>
I don't think you're missing anything, there are two requirements:
idmap, and context switch, and as long as we keep the same VA in Hyp
as in kernel mode (modulo offset for arm64) I don't think there's any
way around having those two tables.
>>> stack_page = __get_cpu_var(kvm_arm_hyp_stack_page);
>>> hyp_stack_ptr = stack_page + PAGE_SIZE;
>>> vector_ptr = (unsigned long)__kvm_hyp_vector;
>>>
>>> - __cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);
>>> + __cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
>>> }
>>>
>>> /**
>>> @@ -863,11 +865,6 @@ static int init_hyp_mode(void)
>>> }
>>>
>>> /*
>>> - * Unmap the identity mapping
>>> - */
>>> - kvm_clear_hyp_idmap();
>>> -
>>> - /*
>>> * Map the Hyp-code called directly from the host
>>> */
>>> err = create_hyp_mappings(__kvm_hyp_code_start, __kvm_hyp_code_end);
>>> diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
>>> index 9f37a79..0ca054f 100644
>>> --- a/arch/arm/kvm/init.S
>>> +++ b/arch/arm/kvm/init.S
>>> @@ -21,6 +21,7 @@
>>> #include <asm/asm-offsets.h>
>>> #include <asm/kvm_asm.h>
>>> #include <asm/kvm_arm.h>
>>> +#include <asm/kvm_mmu.h>
>>>
>>> /********************************************************************
>>> * Hypervisor initialization
>>> @@ -28,6 +29,25 @@
>>> * r0,r1 = Hypervisor pgd pointer
>>> * r2 = top of Hyp stack (kernel VA)
>>> * r3 = pointer to hyp vectors
>>> + *
>>> + * The init scenario is:
>>> + * - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>>> + * runtime stack, runtime vectors
>>> + * - Enable the MMU with the boot pgd
>>> + * - Jump to a target into the trampoline page (remember, this is the same
>>> + * physical page!)
>>> + * - Now switch to the runtime pgd (same VA, and still the same physical
>>> + * page!)
>>> + * - Invalidate TLBs
>>> + * - Set stack and vectors
>>> + * - Profit! (or eret, if you only care about the code).
>>> + *
>>> + * As we only have four registers available to pass parameters (and we
>>> + * need six), we split the init in two phases:
>>> + * - Phase 1: r0,r1 contain the boot PGD, r2 = 0, r3 = 0.
>>> + * Provides the basic HYP init, and enable the MMU.
>>> + * - Phase 2: r0,r1 contain the runtime PGD, r2 = ToS, r3 = vectors
>>> + * Switches to the runtime PGD, set stack and vectors.
>>> */
>>>
>>> .text
>>> @@ -47,6 +67,9 @@ __kvm_hyp_init:
>>> W(b) .
>>>
>>> __do_hyp_init:
>>> + cmp r2, #0 @ We have a SP?
>>> + bne phase2 @ Yes, second stage init
>>> +
>>> @ Set the HTTBR to point to the hypervisor PGD pointer passed
>>> mcrr p15, 4, r0, r1, c2
>>>
>>> @@ -96,14 +119,31 @@ __do_hyp_init:
>>> orr r0, r0, r1
>>> isb
>>> mcr p15, 4, r0, c1, c0, 0 @ HSCR
>>> - isb
>>>
>>> - @ Set stack pointer and return to the kernel
>>> + @ End of init phase-1
>>> + eret
>>> +
>>> +phase2:
>>> + @ Set stack pointer
>>> mov sp, r2
>>>
>>> @ Set HVBAR to point to the HYP vectors
>>> mcr p15, 4, r3, c12, c0, 0 @ HVBAR
>>>
>>> + @ Jump to the trampoline page
>>> + ldr r2, =TRAMPOLINE_VA
>>> + adr r3, target
>>> + bfi r2, r3, #0, #PAGE_SHIFT
>>> + mov pc, r2
>>> +
>>> +target: @ We're now in the trampoline code, switch page tables
>>> + mcrr p15, 4, r0, r1, c2
>>> + isb
>>> +
>>> + @ Invalidate the old TLBs
>>> + mcr p15, 4, r0, c8, c7, 0 @ TLBIALLH
>>> + dsb
>>> +
>>> eret
>>>
>>> .ltorg
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index c4236e4..b20eff2 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -32,9 +32,15 @@
>>>
>>> extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
>>>
>>> +static pgd_t *boot_hyp_pgd;
>>> static pgd_t *hyp_pgd;
>>> static DEFINE_MUTEX(kvm_hyp_pgd_mutex);
>>>
>>> +static void *init_bounce_page;
>>> +static unsigned long hyp_idmap_start;
>>> +static unsigned long hyp_idmap_end;
>>> +static phys_addr_t hyp_idmap_vector;
>>> +
>>> static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>> {
>>> kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>>> @@ -108,9 +114,12 @@ static void free_hyp_pgd_entry(pgd_t *pgdp, unsigned long addr)
>>> /**
>>> * free_hyp_pgds - free Hyp-mode page tables
>>> *
>>> - * Assumes hyp_pgd is a page table used strictly in Hyp-mode and therefore contains
>>> - * either mappings in the kernel memory area (above PAGE_OFFSET), or
>>> - * device mappings in the vmalloc range (from VMALLOC_START to VMALLOC_END).
>>> + * Assumes hyp_pgd is a page table used strictly in Hyp-mode and
>>> + * therefore contains either mappings in the kernel memory area (above
>>> + * PAGE_OFFSET), or device mappings in the vmalloc range (from
>>> + * VMALLOC_START to VMALLOC_END).
>>> + *
>>> + * boot_hyp_pgd should only map two pages for the init code.
>>> */
>>> void free_hyp_pgds(void)
>>> {
>>> @@ -118,6 +127,12 @@ void free_hyp_pgds(void)
>>>
>>> mutex_lock(&kvm_hyp_pgd_mutex);
>>>
>>> + if (boot_hyp_pgd) {
>>> + free_hyp_pgd_entry(boot_hyp_pgd, hyp_idmap_start);
>>> + free_hyp_pgd_entry(boot_hyp_pgd, TRAMPOLINE_VA);
>>> + kfree(boot_hyp_pgd);
>>> + }
>>> +
>>> if (hyp_pgd) {
>>> for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
>>> free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr));
>>> @@ -126,6 +141,7 @@ void free_hyp_pgds(void)
>>> kfree(hyp_pgd);
>>> }
>>>
>>> + kfree(init_bounce_page);
>>> mutex_unlock(&kvm_hyp_pgd_mutex);
>>> }
>>>
>>> @@ -718,14 +734,62 @@ phys_addr_t kvm_mmu_get_httbr(void)
>>> return virt_to_phys(hyp_pgd);
>>> }
>>>
>>> +phys_addr_t kvm_mmu_get_boot_httbr(void)
>>> +{
>>> + VM_BUG_ON(!virt_addr_valid(boot_hyp_pgd));
>>
>> Seems a bit OTT -- it's always going to be in lowmem.
>
> Indeed.
>
>>> + return virt_to_phys(boot_hyp_pgd);
>>> +}
>>> +
>>> +phys_addr_t kvm_get_idmap_vector(void)
>>> +{
>>> + return hyp_idmap_vector;
>>> +}
>>> +
>>> int kvm_mmu_init(void)
>>> {
>>> - unsigned long hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>>> - unsigned long hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>>> int err;
>>>
>>> + hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
>>> + hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
>>> + hyp_idmap_vector = virt_to_phys(__kvm_hyp_init);
>>> +
>>> + if ((hyp_idmap_start ^ hyp_idmap_end) & PAGE_MASK) {
>>> + /*
>>> + * Our init code is crossing a page boundary. Allocate
>>> + * a bounce page, copy the code over and use that.
>>> + */
>>> + size_t len = __hyp_idmap_text_end - __hyp_idmap_text_start;
>>> + phys_addr_t phys_base;
>>> +
>>> + init_bounce_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>>> + if (!init_bounce_page) {
>>> + kvm_err("Couldn't allocate HYP init bounce page\n");
>>> + err = -ENOMEM;
>>> + goto out;
>>> + }
>>
>> Given that you don't really need a lowmem page for the bounce page, this
>> might be better expressed using alloc_page and kmap for the memcpy.
>
> I'm a bit dubious about that. We have to make sure that the memory is
> within the 4GB range, and the only flag I can spot for alloc_page is
> GFP_DMA32, which is not exactly what we want, even if it may work.
>
> And yes, we have a problem with platforms having *all* their memory
> above 4GB.
>
now when we're picking at this, do we really need to memset an entire
page to zero? I know it's nice for debugging, but it is really
unnecessary and would slow down boot so slightly, no?
>> I also wonder whether or not you can eventually free the page and
>> corresponding mappings if !CONFIG_HOTPLUG_CPU?
>
> This is indeed possible, as we're sure nothing will use the boot tage
> tables once all CPUs have switched to the runtime page tables.
>
>>> + memcpy(init_bounce_page, __hyp_idmap_text_start, len);
>>> + /*
>>> + * Warning: the code we just copied to the bounce page
>>> + * must be flushed to the point of coherency.
>>> + * Otherwise, the data may be sitting in L2, and HYP
>>> + * mode won't be able to observe it as it runs with
>>> + * caches off at that point.
>>> + */
>>> + kvm_flush_dcache_to_poc(init_bounce_page, len);
>>> +
>>> + phys_base = virt_to_phys(init_bounce_page);
>>> + hyp_idmap_vector += phys_base - hyp_idmap_start;
>>> + hyp_idmap_start = phys_base;
>>> + hyp_idmap_end = phys_base + len;
>>> +
>>> + kvm_info("Using HYP init bounce page @%lx\n",
>>> + (unsigned long)phys_base);
>>> + }
>>> +
>>> hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>>> - if (!hyp_pgd) {
>>> + boot_hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
>>> + if (!hyp_pgd || !boot_hyp_pgd) {
>>> kvm_err("Hyp mode PGD not allocated\n");
>>> err = -ENOMEM;
>>> goto out;
>>> @@ -743,39 +807,30 @@ int kvm_mmu_init(void)
>>> goto out;
>>> }
>>>
>>> + /* Map the very same page at the trampoline VA */
>>> + err = __create_hyp_mappings(boot_hyp_pgd,
>>> + TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>>> + __phys_to_pfn(hyp_idmap_start),
>>> + PAGE_HYP);
>>> + if (err) {
>>> + kvm_err("Failed to map trampoline @%lx into boot HYP pgd\n",
>>> + TRAMPOLINE_VA);
>>> + goto out;
>>> + }
>>> +
>>> + /* Map the same page again into the runtime page tables */
>>> + err = __create_hyp_mappings(hyp_pgd,
>>> + TRAMPOLINE_VA, TRAMPOLINE_VA + PAGE_SIZE,
>>> + __phys_to_pfn(hyp_idmap_start),
>>> + PAGE_HYP);
>>> + if (err) {
>>> + kvm_err("Failed to map trampoline @%lx into runtime HYP pgd\n",
>>> + TRAMPOLINE_VA);
>>> + goto out;
>>> + }
>>
>> I'm probably just missing it, but where are the updated page tables flushed
>> to memory?
>
> Mumble mumble... We have some partial flush, but clearly not enough of
> it to be completely safe. Good spotting.
>
> I'll update the series and send out a v3.
>
> Thanks for reviewing.
>
> M.
> --
> Jazz is not dead. It just smells funny...
>
More information about the linux-arm-kernel
mailing list