[PATCH v12 04/16] arm64: kvm: allows kvm cpu hotplug

Mon Dec 14 23:51:29 PST 2015

On 12/15/2015 02:33 AM, Marc Zyngier wrote:
> On 14/12/15 07:33, AKASHI Takahiro wrote:
>> Marc,
>>
>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>> Ashwin, Marc,
>>>>
>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff at infradead.org> wrote:
>>>>>>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>>
>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>> core in EL2.
>>>>>>>
>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>
>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>> compiling errors.
>>>>>>>
>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>> ---
>>>>>>>     arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>     arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>     arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>     arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>     arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>     arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>     arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>     arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>     9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>            }
>>>>>>>
>>>>>>>            /*
>>>>>>> -        * Execute the init code on each CPU.
>>>>>>> -        */
>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>> -
>>>>>>> -       /*
>>>>>>>             * Init HYP view of VGIC
>>>>>>>             */
>>>>>>>            err = kvm_vgic_hyp_init();
>>>>>>
>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>> anymore.
>>>>
>>>> Thank you for pointing this out.
>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>> I didn't notice this problem.
>>>
>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>> needs some information contained in these registers before being able to
>>> get initialized.
>>
>> I see.
>>
>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>> for pointing it out.
>>>>>
>>>>>> Whats the best way to fix this?
>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>
>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>> there.
>>>>
>>>> Do you think so?
>>>> How about the fixup! patch attached below?
>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>> kvm cpu hotplug will still continue to work as before.
>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>> original code, the change will not be a big jump.
>>>
>>> This seems quite complicated:
>>> - init EL2 on  all CPUs
>>> - do some initialization
>>> - tear all CPUs EL2 down
>>> - let KVM drive the vectors being set or not
>>>
>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>> work on a single one?
>>
>> I did initialize all the cpus partly because using preempt_enable/disable
>> looked a bit ugly and partly because we may, in the future, do additional
>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>> But if you're comfortable with preempt_*() stuff, I don' care.
>>
>>
>>> Also, the simple fact that we were able to get some junk value is a sign
>>> that something is amiss. I'd expect a splat of some sort, because we now
>>> have a possibility of doing things in the wrong context.
>>>
>>>>
>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>
>>>> My only concern is the following kernel message I saw when kexec shut down
>>>> the kernel:
>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>
>>>> ===
>>>> sh-4.3# ./kexec -d -e
>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>> arch_process_options:112: command_line: (null)
>>>> arch_process_options:114: initrd: (null)
>>>> arch_process_options:115: dtb: (null)
>>>> arch_process_options:117: port: 0x0
>>>> kvm: exiting hardware virtualization
>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>
>>> That makes me feel very uncomfortable. It looks like we've exited a
>>> guest with some horrible value in X0. How is that even possible?
>>>
>>> This deserves to be investigated.
>>
>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>> is still running in kvm_arch_vcpu_ioctl_run().
>> So adding a check whether cpu has been initialized or not in every iteration of
>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>> a guest mode. Since this check is done while interrupt is disabled, it won't
>> interfere with kvm_arch_hardware_disable() called via IPI.
>> See the attached fixup patch.
>>
>> Again, I verified the code on model.
>>
>> Thanks,
>> -Takahiro AKASHI
>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>
>> ----8<----
>>   From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>
>> ---
>>    arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>    1 file changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 518c3c7..d7e86fb 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>    		/*
>>    		 * Re-check atomic conditions
>>    		 */
>> -		if (signal_pending(current)) {
>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>> +			/* cpu has been torn down */
>> +			ret = -ENOEXEC;
>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>
>
> That feels completely overkill (and very slow). Why don't you maintain a
> per-cpu variable containing the CPU states, which will avoid calling
> __hyp_get_vectors() all the time? You should be able to reuse that
> construct everywhere.

OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
cpuidle issue, we will be able to re-use it.

> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
> (called on triple fault).

No, I don't think so.
Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
is handled in a generic way and results in a reset request.
On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.
In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
will never be examined.
So I think
    ret -> 0
    run->exit_reason -> KVM_EXIT_SHUTDOWN
or just
    ret -> -ENOEXEC
is the best.

In either way, a guest will have no good chance to gracefully shutdown itself
because we're kexec'ing (without waiting for threads' termination).

-Takahiro AKASHI

> KVM_EXIT_FAIL_ENTRY looks more appropriate,
> and the hardware_entry_failure_reason field should be populated (and
> documented).
>
> Thanks,
>
> 	M.
>