[PATCH v12 04/16] arm64: kvm: allows kvm cpu hotplug

Tue Dec 15 01:51:03 PST 2015

On 12/15/2015 05:45 PM, Marc Zyngier wrote:
> On 15/12/15 07:51, AKASHI Takahiro wrote:
>> On 12/15/2015 02:33 AM, Marc Zyngier wrote:
>>> On 14/12/15 07:33, AKASHI Takahiro wrote:
>>>> Marc,
>>>>
>>>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>>>> Ashwin, Marc,
>>>>>>
>>>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff at infradead.org> wrote:
>>>>>>>>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>>>>
>>>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>>>> core in EL2.
>>>>>>>>>
>>>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>>>
>>>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>>>> compiling errors.
>>>>>>>>>
>>>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>>>> ---
>>>>>>>>>      arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>>>      arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>>>      arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>>>      arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>>>      arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>>>      arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>>>      arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>>>      arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>>>      arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>>>      9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>             /*
>>>>>>>>> -        * Execute the init code on each CPU.
>>>>>>>>> -        */
>>>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>>>> -
>>>>>>>>> -       /*
>>>>>>>>>              * Init HYP view of VGIC
>>>>>>>>>              */
>>>>>>>>>             err = kvm_vgic_hyp_init();
>>>>>>>>
>>>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>>>> anymore.
>>>>>>
>>>>>> Thank you for pointing this out.
>>>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>>>> I didn't notice this problem.
>>>>>
>>>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>>>> needs some information contained in these registers before being able to
>>>>> get initialized.
>>>>
>>>> I see.
>>>>
>>>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>>>> for pointing it out.
>>>>>>>
>>>>>>>> Whats the best way to fix this?
>>>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>>>
>>>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>>>> there.
>>>>>>
>>>>>> Do you think so?
>>>>>> How about the fixup! patch attached below?
>>>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>>>> kvm cpu hotplug will still continue to work as before.
>>>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>>>> original code, the change will not be a big jump.
>>>>>
>>>>> This seems quite complicated:
>>>>> - init EL2 on  all CPUs
>>>>> - do some initialization
>>>>> - tear all CPUs EL2 down
>>>>> - let KVM drive the vectors being set or not
>>>>>
>>>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>>>> work on a single one?
>>>>
>>>> I did initialize all the cpus partly because using preempt_enable/disable
>>>> looked a bit ugly and partly because we may, in the future, do additional
>>>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>>>> But if you're comfortable with preempt_*() stuff, I don' care.
>>>>
>>>>
>>>>> Also, the simple fact that we were able to get some junk value is a sign
>>>>> that something is amiss. I'd expect a splat of some sort, because we now
>>>>> have a possibility of doing things in the wrong context.
>>>>>
>>>>>>
>>>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>>>
>>>>>> My only concern is the following kernel message I saw when kexec shut down
>>>>>> the kernel:
>>>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>>>
>>>>>> ===
>>>>>> sh-4.3# ./kexec -d -e
>>>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>>>> arch_process_options:112: command_line: (null)
>>>>>> arch_process_options:114: initrd: (null)
>>>>>> arch_process_options:115: dtb: (null)
>>>>>> arch_process_options:117: port: 0x0
>>>>>> kvm: exiting hardware virtualization
>>>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>>>
>>>>> That makes me feel very uncomfortable. It looks like we've exited a
>>>>> guest with some horrible value in X0. How is that even possible?
>>>>>
>>>>> This deserves to be investigated.
>>>>
>>>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>>>> is still running in kvm_arch_vcpu_ioctl_run().
>>>> So adding a check whether cpu has been initialized or not in every iteration of
>>>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>>>> a guest mode. Since this check is done while interrupt is disabled, it won't
>>>> interfere with kvm_arch_hardware_disable() called via IPI.
>>>> See the attached fixup patch.
>>>>
>>>> Again, I verified the code on model.
>>>>
>>>> Thanks,
>>>> -Takahiro AKASHI
>>>>
>>>>> Thanks,
>>>>>
>>>>> 	M.
>>>>>
>>>>
>>>> ----8<----
>>>>    From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>>>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>>>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>>>
>>>> ---
>>>>     arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>>>     1 file changed, 34 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>>> index 518c3c7..d7e86fb 100644
>>>> --- a/arch/arm/kvm/arm.c
>>>> +++ b/arch/arm/kvm/arm.c
>>>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>>     		/*
>>>>     		 * Re-check atomic conditions
>>>>     		 */
>>>> -		if (signal_pending(current)) {
>>>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>>>> +			/* cpu has been torn down */
>>>> +			ret = -ENOEXEC;
>>>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>>>
>>>
>>> That feels completely overkill (and very slow). Why don't you maintain a
>>> per-cpu variable containing the CPU states, which will avoid calling
>>> __hyp_get_vectors() all the time? You should be able to reuse that
>>> construct everywhere.
>>
>> OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
>> cpuidle issue, we will be able to re-use it.
>>
>>> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
>>> (called on triple fault).
>>
>> No, I don't think so.
>
> maz at approximate:~/Work/arm-platforms$ git grep KVM_EXIT_SHUTDOWN
> arch/x86/kvm/svm.c:     kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
> arch/x86/kvm/vmx.c:     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> arch/x86/kvm/x86.c:                     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> include/uapi/linux/kvm.h:#define KVM_EXIT_SHUTDOWN         8
>
> And that's it. No other architecture ever generates this, and this is
> an undocumented API. So I'm not going to let that in until someone actually
> defines what this thing means.
>
>> Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
>> is handled in a generic way and results in a reset request.
>
> Which is not what we want. We want to indicate that the guest couldn't
> be entered. This is not due to a guest doing a triple fault (which is
> the way an x86 system gets rebooted).
>
>> On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.
>
> Certainly arch specific, but actually extremely accurate. You couldn't
> enter the guest, and you describe the reason in an architecture-specific
> fashion. This is also the only exit code that describe this exact case
> we're talking about here.
>
>> In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
>> will never be examined.
>> So I think
>>      ret -> 0
>>      run->exit_reason -> KVM_EXIT_SHUTDOWN
>
> ret = 0
> run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> run->hardware_entry_failure_reason = (u64)-ENOEXEC;

OK.

>> or just
>>      ret -> -ENOEXEC
>> is the best.
>>
>> In either way, a guest will have no good chance to gracefully shutdown itself
>> because we're kexec'ing (without waiting for threads' termination).
>
> Well, at least userspace gets a chance - and should kexec fail, we have
> a chance to recover.

Well, the current kexec implementation (on arm64) never fails
except very early stage :)

So please review the attached fixup patch, again.

Thanks,
-Takahiro AKASHI


> Thanks,
>
> 	M.
>

----8<----
 From ec6c07fe80d6ba96855468f61daffa9b91cf5622 Mon Sep 17 00:00:00 2001
From: AKASHI Takahiro <takahiro.akashi at linaro.org>
Date: Fri, 11 Dec 2015 13:43:35 +0900
Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug

---
  arch/arm/kvm/arm.c |   62 +++++++++++++++++++++++++++++++++++-----------------
  1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 518c3c7..05eaa35 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -573,7 +573,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
  		/*
  		 * Re-check atomic conditions
  		 */
-		if (signal_pending(current)) {
+		if (!__this_cpu_read(kvm_arm_hardware_enabled)) {
+			/* cpu has been torn down */
+			ret = 0;
+			run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+			run->fail_entry.hardware_entry_failure_reason
+					= (u64)-ENOEXEC;
+		} else if (signal_pending(current)) {
  			ret = -EINTR;
  			run->exit_reason = KVM_EXIT_INTR;
  		}
@@ -950,7 +956,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
  	}
  }

-int kvm_arch_hardware_enable(void)
+static void cpu_init_hyp_mode(void)
  {
  	phys_addr_t boot_pgd_ptr;
  	phys_addr_t pgd_ptr;
@@ -958,9 +964,6 @@ int kvm_arch_hardware_enable(void)
  	unsigned long stack_page;
  	unsigned long vector_ptr;

-	if (__hyp_get_vectors() != hyp_default_vectors)
-		return 0;
-
  	/* Switch from the HYP stub to our own HYP init vector */
  	__hyp_set_vectors(kvm_get_idmap_vector());

@@ -973,24 +976,38 @@ int kvm_arch_hardware_enable(void)
  	__cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);

  	kvm_arm_init_debug();
-
-	return 0;
  }

-void kvm_arch_hardware_disable(void)
+static void cpu_reset_hyp_mode(void)
  {
  	phys_addr_t boot_pgd_ptr;
  	phys_addr_t phys_idmap_start;

-	if (__hyp_get_vectors() == hyp_default_vectors)
-		return;
-
  	boot_pgd_ptr = kvm_mmu_get_boot_httbr();
  	phys_idmap_start = kvm_get_idmap_start();

  	__cpu_reset_hyp_mode(boot_pgd_ptr, phys_idmap_start);
  }

+int kvm_arch_hardware_enable(void)
+{
+	if (!__this_cpu_read(kvm_arm_hardware_enabled)) {
+		cpu_init_hyp_mode();
+		__this_cpu_write(kvm_arm_hardware_enabled, 1);
+	}
+
+	return 0;
+}
+
+void kvm_arch_hardware_disable(void)
+{
+	if (!__this_cpu_read(kvm_arm_hardware_enabled))
+		return;
+
+	cpu_reset_hyp_mode();
+	__this_cpu_write(kvm_arm_hardware_enabled, 0);
+}
+
  #ifdef CONFIG_CPU_PM
  static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
  				    unsigned long cmd,
@@ -998,19 +1015,13 @@ static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
  {
  	switch (cmd) {
  	case CPU_PM_ENTER:
-		if (__hyp_get_vectors() != hyp_default_vectors)
-			__this_cpu_write(kvm_arm_hardware_enabled, 1);
-		else
-			__this_cpu_write(kvm_arm_hardware_enabled, 0);
-		/*
-		 * don't call kvm_arch_hardware_disable() in case of
-		 * CPU_PM_ENTER because it does't actually save any state.
-		 */
+		if (__this_cpu_read(kvm_arm_hardware_enabled))
+			cpu_reset_hyp_mode();

  		return NOTIFY_OK;
  	case CPU_PM_EXIT:
  		if (__this_cpu_read(kvm_arm_hardware_enabled))
-			kvm_arch_hardware_enable();
+			cpu_init_hyp_mode();

  		return NOTIFY_OK;

@@ -1114,9 +1125,20 @@ static int init_hyp_mode(void)
  	}

  	/*
+	 * Init this CPU temporarily to execute kvm_hyp_call()
+	 * during kvm_vgic_hyp_init().
+	 */
+	preempt_disable();
+	cpu_init_hyp_mode();
+
+	/*
  	 * Init HYP view of VGIC
  	 */
  	err = kvm_vgic_hyp_init();
+
+	cpu_reset_hyp_mode();
+	preempt_enable();
+
  	if (err)
  		goto out_free_context;

-- 
1.7.9.5