[PATCH v3 14/20] KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit
Marc Zyngier
marc.zyngier at arm.com
Tue Oct 10 01:47:33 PDT 2017
On Sat, Sep 23 2017 at 2:42:01 am BST, Christoffer Dall <cdall at linaro.org> wrote:
> We don't need to save and restore the hardware timer state and examine
> if it generates interrupts on on every entry/exit to the guest. The
> timer hardware is perfectly capable of telling us when it has expired
> by signaling interrupts.
>
> When taking a vtimer interrupt in the host, we don't want to mess with
> the timer configuration, we just want to forward the physical interrupt
> to the guest as a virtual interrupt. We can use the split priority drop
> and deactivate feature of the GIC to do this, which leaves an EOI'ed
> interrupt active on the physical distributor, making sure we don't keep
> taking timer interrupts which would prevent the guest from running. We
> can then forward the physical interrupt to the VM using the HW bit in
> the LR of the GIC VE, like we do already, which lets the guest directly
VE?
> deactivate both the physical and virtual timer simultaneously, allowing
> the timer hardware to exit the VM and generate a new physical interrupt
> when the timer output is again asserted later on.
>
> We do need to capture this state when migrating VCPUs between physical
> CPUs, however, which we use the vcpu put/load functions for, which are
> called through preempt notifiers whenever the thread is scheduled away
> from the CPU or called directly if we return from the ioctl to
> userspace.
>
> One caveat is that we cannot restore the timer state during
> kvm_timer_vcpu_load, because the flow of sleeping a VCPU is:
>
> 1. kvm_vcpu_block
> 2. kvm_timer_schedule
> 3. schedule
> 4. kvm_timer_vcpu_put (preempt notifier)
> 5. schedule (vcpu thread gets scheduled back)
> 6. kvm_timer_vcpu_load
> <---- We restore the hardware state here, but the bg_timer
> hrtimer may have scheduled a work function that also
> changes the timer state here.
> 7. kvm_timer_unschedule
> <---- We can restore the state here instead
>
> So, while we do need to restore the timer state in step (6) in all other
> cases than when we called kvm_vcpu_block(), we have to defer the restore
> to step (7) when coming back after kvm_vcpu_block(). Note that we
> cannot simply call cancel_work_sync() in step (6), because vcpu_load can
> be called from a preempt notifier.
>
> An added benefit beyond not having to read and write the timer sysregs
> on every entry and exit is that we no longer have to actively write the
> active state to the physical distributor, because we set the affinity of
I don't understand this thing about the affinity of the timer. It is a
PPI, so it cannot go anywhere else.
> the vtimer interrupt when loading the timer state, so that the interrupt
> automatically stays active after firing.
>
> Signed-off-by: Christoffer Dall <cdall at linaro.org>
> ---
> include/kvm/arm_arch_timer.h | 9 +-
> virt/kvm/arm/arch_timer.c | 238 +++++++++++++++++++++++++++----------------
> virt/kvm/arm/arm.c | 19 +++-
> virt/kvm/arm/hyp/timer-sr.c | 8 +-
> 4 files changed, 174 insertions(+), 100 deletions(-)
>
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index 16887c0..8e5ed54 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -31,8 +31,8 @@ struct arch_timer_context {
> /* Timer IRQ */
> struct kvm_irq_level irq;
>
> - /* Active IRQ state caching */
> - bool active_cleared_last;
> + /* Is the timer state loaded on the hardware timer */
> + bool loaded;
I think this little guy is pretty crucial to understand the flow, as
there is now two points where we save/restore the timer:
vcpu_load/vcpu_put and timer_schedule/timer_unschedule. Both can be
executed on the blocking path, and this is the predicate to find out if
there is actually something to do.
Would you mind adding a small comment to that effect?
>
> /* Virtual offset */
> u64 cntvoff;
> @@ -80,10 +80,15 @@ void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
>
> u64 kvm_phys_timer_read(void);
>
> +void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
> void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu);
>
> void kvm_timer_init_vhe(void);
>
> #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.vtimer)
> #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.ptimer)
> +
> +void enable_el1_phys_timer_access(void);
> +void disable_el1_phys_timer_access(void);
> +
> #endif
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 4275f8f..70110ea 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -46,10 +46,9 @@ static const struct kvm_irq_level default_vtimer_irq = {
> .level = 1,
> };
>
> -void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> -{
> - vcpu_vtimer(vcpu)->active_cleared_last = false;
> -}
> +static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx);
> +static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> + struct arch_timer_context *timer_ctx);
>
> u64 kvm_phys_timer_read(void)
> {
> @@ -74,17 +73,37 @@ static void soft_timer_cancel(struct hrtimer *hrt, struct work_struct *work)
> cancel_work_sync(work);
> }
>
> -static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
> +static void kvm_vtimer_update_mask_user(struct kvm_vcpu *vcpu)
> {
> - struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
> + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>
> /*
> - * We disable the timer in the world switch and let it be
> - * handled by kvm_timer_sync_hwstate(). Getting a timer
> - * interrupt at this point is a sure sign of some major
> - * breakage.
> + * To prevent continuously exiting from the guest, we mask the
> + * physical interrupt when the virtual level is high, such that the
> + * guest can make forward progress. Once we detect the output level
> + * being deasserted, we unmask the interrupt again so that we exit
> + * from the guest when the timer fires.
Maybe an additional comment indicating that this only makes sense when
we don't have an in-kernel GIC? I know this wasn't in the original code,
but I started asking myself all kind of questions until I realised what
this was for...
> */
> - pr_warn("Unexpected interrupt %d on vcpu %p\n", irq, vcpu);
> + if (vtimer->irq.level)
> + disable_percpu_irq(host_vtimer_irq);
> + else
> + enable_percpu_irq(host_vtimer_irq, 0);
> +}
> +
> +static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
> +{
> + struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
> + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> +
> + if (!vtimer->irq.level) {
> + vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl);
> + if (kvm_timer_irq_can_fire(vtimer))
> + kvm_timer_update_irq(vcpu, true, vtimer);
> + }
> +
> + if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
> + kvm_vtimer_update_mask_user(vcpu);
> +
> return IRQ_HANDLED;
> }
>
> @@ -220,7 +239,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> {
> int ret;
>
> - timer_ctx->active_cleared_last = false;
> timer_ctx->irq.level = new_level;
> trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_ctx->irq.irq,
> timer_ctx->irq.level);
> @@ -276,10 +294,16 @@ static void phys_timer_emulate(struct kvm_vcpu *vcpu,
> soft_timer_start(&timer->phys_timer, kvm_timer_compute_delta(timer_ctx));
> }
>
> -static void timer_save_state(struct kvm_vcpu *vcpu)
> +static void vtimer_save_state(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> + unsigned long flags;
> +
> + local_irq_save(flags);
Is that to avoid racing against the timer when doing a
vcpu_put/timer/schedule?
> +
> + if (!vtimer->loaded)
> + goto out;
>
> if (timer->enabled) {
> vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl);
> @@ -288,6 +312,10 @@ static void timer_save_state(struct kvm_vcpu *vcpu)
>
> /* Disable the virtual timer */
> write_sysreg_el0(0, cntv_ctl);
> +
> + vtimer->loaded = false;
> +out:
> + local_irq_restore(flags);
> }
>
> /*
> @@ -303,6 +331,8 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
>
> BUG_ON(bg_timer_is_armed(timer));
>
> + vtimer_save_state(vcpu);
> +
> /*
> * No need to schedule a background timer if any guest timer has
> * already expired, because kvm_vcpu_block will return before putting
> @@ -326,16 +356,26 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
> soft_timer_start(&timer->bg_timer, kvm_timer_earliest_exp(vcpu));
> }
>
> -static void timer_restore_state(struct kvm_vcpu *vcpu)
> +static void vtimer_restore_state(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> + unsigned long flags;
> +
> + local_irq_save(flags);
> +
> + if (vtimer->loaded)
> + goto out;
>
> if (timer->enabled) {
> write_sysreg_el0(vtimer->cnt_cval, cntv_cval);
> isb();
> write_sysreg_el0(vtimer->cnt_ctl, cntv_ctl);
> }
> +
> + vtimer->loaded = true;
> +out:
> + local_irq_restore(flags);
> }
>
> void kvm_timer_unschedule(struct kvm_vcpu *vcpu)
> @@ -344,6 +384,8 @@ void kvm_timer_unschedule(struct kvm_vcpu *vcpu)
>
> soft_timer_cancel(&timer->bg_timer, &timer->expired);
> timer->armed = false;
> +
> + vtimer_restore_state(vcpu);
> }
>
> static void set_cntvoff(u64 cntvoff)
> @@ -353,61 +395,56 @@ static void set_cntvoff(u64 cntvoff)
> kvm_call_hyp(__kvm_timer_set_cntvoff, low, high);
> }
>
> -static void kvm_timer_flush_hwstate_vgic(struct kvm_vcpu *vcpu)
> +static void kvm_timer_vcpu_load_vgic(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> bool phys_active;
> int ret;
>
> - /*
> - * If we enter the guest with the virtual input level to the VGIC
> - * asserted, then we have already told the VGIC what we need to, and
> - * we don't need to exit from the guest until the guest deactivates
> - * the already injected interrupt, so therefore we should set the
> - * hardware active state to prevent unnecessary exits from the guest.
> - *
> - * Also, if we enter the guest with the virtual timer interrupt active,
> - * then it must be active on the physical distributor, because we set
> - * the HW bit and the guest must be able to deactivate the virtual and
> - * physical interrupt at the same time.
> - *
> - * Conversely, if the virtual input level is deasserted and the virtual
> - * interrupt is not active, then always clear the hardware active state
> - * to ensure that hardware interrupts from the timer triggers a guest
> - * exit.
> - */
> - phys_active = vtimer->irq.level ||
> - kvm_vgic_map_is_active(vcpu, vtimer->irq.irq);
> -
> - /*
> - * We want to avoid hitting the (re)distributor as much as
> - * possible, as this is a potentially expensive MMIO access
> - * (not to mention locks in the irq layer), and a solution for
> - * this is to cache the "active" state in memory.
> - *
> - * Things to consider: we cannot cache an "active set" state,
> - * because the HW can change this behind our back (it becomes
> - * "clear" in the HW). We must then restrict the caching to
> - * the "clear" state.
> - *
> - * The cache is invalidated on:
> - * - vcpu put, indicating that the HW cannot be trusted to be
> - * in a sane state on the next vcpu load,
> - * - any change in the interrupt state
> - *
> - * Usage conditions:
> - * - cached value is "active clear"
> - * - value to be programmed is "active clear"
> - */
> - if (vtimer->active_cleared_last && !phys_active)
> - return;
> -
> + if (vtimer->irq.level || kvm_vgic_map_is_active(vcpu, vtimer->irq.irq))
> + phys_active = true;
> + else
> + phys_active = false;
nit: this can be written as:
phys_active = (vtimer->irq.level ||
kvm_vgic_map_is_active(vcpu, vtimer->irq.irq));
Not that it matters in the slightest...
> ret = irq_set_irqchip_state(host_vtimer_irq,
> IRQCHIP_STATE_ACTIVE,
> phys_active);
> WARN_ON(ret);
> +}
> +
> +static void kvm_timer_vcpu_load_user(struct kvm_vcpu *vcpu)
> +{
> + kvm_vtimer_update_mask_user(vcpu);
> +}
> +
> +void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
> +{
> + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> +
> + if (unlikely(!timer->enabled))
> + return;
> +
> + if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
> + kvm_timer_vcpu_load_user(vcpu);
> + else
> + kvm_timer_vcpu_load_vgic(vcpu);
>
> - vtimer->active_cleared_last = !phys_active;
> + set_cntvoff(vtimer->cntvoff);
> +
> + /*
> + * If we armed a soft timer and potentially queued work, we have to
> + * cancel this, but cannot do it here, because canceling work can
> + * sleep and we can be in the middle of a preempt notifier call.
> + * Instead, when the timer has been armed, we know the return path
> + * from kvm_vcpu_block will call kvm_timer_unschedule, so we can defer
> + * restoring the state and canceling any soft timers and work items
> + * until then.
> + */
> + if (!bg_timer_is_armed(timer))
> + vtimer_restore_state(vcpu);
> +
> + if (has_vhe())
> + disable_el1_phys_timer_access();
> }
>
> bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
> @@ -427,23 +464,6 @@ bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
> ptimer->irq.level != plevel;
> }
>
> -static void kvm_timer_flush_hwstate_user(struct kvm_vcpu *vcpu)
> -{
> - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> -
> - /*
> - * To prevent continuously exiting from the guest, we mask the
> - * physical interrupt such that the guest can make forward progress.
> - * Once we detect the output level being deasserted, we unmask the
> - * interrupt again so that we exit from the guest when the timer
> - * fires.
> - */
> - if (vtimer->irq.level)
> - disable_percpu_irq(host_vtimer_irq);
> - else
> - enable_percpu_irq(host_vtimer_irq, 0);
> -}
> -
> /**
> * kvm_timer_flush_hwstate - prepare timers before running the vcpu
> * @vcpu: The vcpu pointer
> @@ -456,23 +476,55 @@ static void kvm_timer_flush_hwstate_user(struct kvm_vcpu *vcpu)
> void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> + struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
>
> if (unlikely(!timer->enabled))
> return;
>
> - kvm_timer_update_state(vcpu);
> + if (kvm_timer_should_fire(ptimer) != ptimer->irq.level)
> + kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer);
>
> /* Set the background timer for the physical timer emulation. */
> phys_timer_emulate(vcpu, vcpu_ptimer(vcpu));
> +}
>
> - if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
> - kvm_timer_flush_hwstate_user(vcpu);
> - else
> - kvm_timer_flush_hwstate_vgic(vcpu);
> +void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>
> - set_cntvoff(vtimer->cntvoff);
> - timer_restore_state(vcpu);
> + if (unlikely(!timer->enabled))
> + return;
> +
> + if (has_vhe())
> + enable_el1_phys_timer_access();
> +
> + vtimer_save_state(vcpu);
> +
> + set_cntvoff(0);
Can this be moved into vtimer_save_state()? And thinking of it, why
don't we reset cntvoff in kvm_timer_schedule() as well?
> +}
> +
> +static void unmask_vtimer_irq(struct kvm_vcpu *vcpu)
> +{
> + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> +
> + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) {
> + kvm_vtimer_update_mask_user(vcpu);
> + return;
> + }
> +
> + /*
> + * If the guest disabled the timer without acking the interrupt, then
> + * we must make sure the physical and virtual active states are in
> + * sync by deactivating the physical interrupt, because otherwise we
> + * wouldn't see the next timer interrupt in the host.
> + */
> + if (!kvm_vgic_map_is_active(vcpu, vtimer->irq.irq)) {
> + int ret;
> + ret = irq_set_irqchip_state(host_vtimer_irq,
> + IRQCHIP_STATE_ACTIVE,
> + false);
> + WARN_ON(ret);
> + }
> }
>
> /**
> @@ -485,6 +537,7 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu)
> {
> struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>
> /*
> * This is to cancel the background timer for the physical timer
> @@ -492,14 +545,19 @@ void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu)
> */
> soft_timer_cancel(&timer->phys_timer, NULL);
>
> - timer_save_state(vcpu);
> - set_cntvoff(0);
> -
> /*
> - * The guest could have modified the timer registers or the timer
> - * could have expired, update the timer state.
> + * If we entered the guest with the vtimer output asserted we have to
> + * check if the guest has modified the timer so that we should lower
> + * the line at this point.
> */
> - kvm_timer_update_state(vcpu);
> + if (vtimer->irq.level) {
> + vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl);
> + vtimer->cnt_cval = read_sysreg_el0(cntv_cval);
> + if (!kvm_timer_should_fire(vtimer)) {
> + kvm_timer_update_irq(vcpu, false, vtimer);
> + unmask_vtimer_irq(vcpu);
> + }
> + }
> }
>
> int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu)
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 27db222..132d39a 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -354,18 +354,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
>
> kvm_arm_set_running_vcpu(vcpu);
> -
> kvm_vgic_load(vcpu);
> + kvm_timer_vcpu_load(vcpu);
> }
>
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + kvm_timer_vcpu_put(vcpu);
> kvm_vgic_put(vcpu);
>
> vcpu->cpu = -1;
>
> kvm_arm_set_running_vcpu(NULL);
> - kvm_timer_vcpu_put(vcpu);
> }
>
> static void vcpu_power_off(struct kvm_vcpu *vcpu)
> @@ -710,16 +710,27 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> kvm_arm_clear_debug(vcpu);
>
> /*
> - * We must sync the PMU and timer state before the vgic state so
> + * We must sync the PMU state before the vgic state so
> * that the vgic can properly sample the updated state of the
> * interrupt line.
> */
> kvm_pmu_sync_hwstate(vcpu);
> - kvm_timer_sync_hwstate(vcpu);
>
> + /*
> + * Sync the vgic state before syncing the timer state because
> + * the timer code needs to know if the virtual timer
> + * interrupts are active.
> + */
> kvm_vgic_sync_hwstate(vcpu);
>
> /*
> + * Sync the timer hardware state before enabling interrupts as
> + * we don't want vtimer interrupts to race with syncing the
> + * timer virtual interrupt state.
> + */
> + kvm_timer_sync_hwstate(vcpu);
> +
> + /*
> * We may have taken a host interrupt in HYP mode (ie
> * while executing the guest). This interrupt is still
> * pending, as we haven't serviced it yet!
> diff --git a/virt/kvm/arm/hyp/timer-sr.c b/virt/kvm/arm/hyp/timer-sr.c
> index a6c3b10..f398616 100644
> --- a/virt/kvm/arm/hyp/timer-sr.c
> +++ b/virt/kvm/arm/hyp/timer-sr.c
> @@ -27,7 +27,7 @@ void __hyp_text __kvm_timer_set_cntvoff(u32 cntvoff_low, u32 cntvoff_high)
> write_sysreg(cntvoff, cntvoff_el2);
> }
>
> -void __hyp_text enable_phys_timer(void)
> +void __hyp_text enable_el1_phys_timer_access(void)
> {
> u64 val;
>
> @@ -37,7 +37,7 @@ void __hyp_text enable_phys_timer(void)
> write_sysreg(val, cnthctl_el2);
> }
>
> -void __hyp_text disable_phys_timer(void)
> +void __hyp_text disable_el1_phys_timer_access(void)
> {
> u64 val;
>
> @@ -58,11 +58,11 @@ void __hyp_text __timer_disable_traps(struct kvm_vcpu *vcpu)
> * with HCR_EL2.TGE ==1, which makes those bits have no impact.
> */
> if (!has_vhe())
> - enable_phys_timer();
> + enable_el1_phys_timer_access();
> }
>
> void __hyp_text __timer_enable_traps(struct kvm_vcpu *vcpu)
> {
> if (!has_vhe())
> - disable_phys_timer();
> + disable_el1_phys_timer_access();
> }
It'd be nice to move this renaming to the patch that introduce these two
functions.
Thanks,
M.
--
Jazz is not dead, it just smell funny.
More information about the linux-arm-kernel
mailing list