[PATCH v3 00/41] Optimize KVM/ARM for VHE systems

Wed Jan 17 02:48:43 PST 2018

On Wed, Jan 17, 2018 at 11:34:54AM +0300, Yury Norov wrote:
> On Mon, Jan 15, 2018 at 04:50:36PM +0100, Christoffer Dall wrote:
> > Hi Yury,
> > 
> > On Mon, Jan 15, 2018 at 05:14:23PM +0300, Yury Norov wrote:
> > > On Fri, Jan 12, 2018 at 01:07:06PM +0100, Christoffer Dall wrote:
> > > > This series redesigns parts of KVM/ARM to optimize the performance on
> > > > VHE systems.  The general approach is to try to do as little work as
> > > > possible when transitioning between the VM and the hypervisor.  This has
> > > > the benefit of lower latency when waiting for interrupts and delivering
> > > > virtual interrupts, and reduces the overhead of emulating behavior and
> > > > I/O in the host kernel.
> > > > 
> > > > Patches 01 through 06 are not VHE specific, but rework parts of KVM/ARM
> > > > that can be generally improved.  We then add infrastructure to move more
> > > > logic into vcpu_load and vcpu_put, we improve handling of VFP and debug
> > > > registers.
> > > > 
> > > > We then introduce a new world-switch function for VHE systems, which we
> > > > can tweak and optimize for VHE systems.  To do that, we rework a lot of
> > > > the system register save/restore handling and emulation code that may
> > > > need access to system registers, so that we can defer as many system
> > > > register save/restore operations to vcpu_load and vcpu_put, and move
> > > > this logic out of the VHE world switch function.
> > > > 
> > > > We then optimize the configuration of traps.  On non-VHE systems, both
> > > > the host and VM kernels run in EL1, but because the host kernel should
> > > > have full access to the underlying hardware, but the VM kernel should
> > > > not, we essentially make the host kernel more privileged than the VM
> > > > kernel despite them both running at the same privilege level by enabling
> > > > VE traps when entering the VM and disabling those traps when exiting the
> > > > VM.  On VHE systems, the host kernel runs in EL2 and has full access to
> > > > the hardware (as much as allowed by secure side software), and is
> > > > unaffected by the trap configuration.  That means we can configure the
> > > > traps for VMs running in EL1 once, and don't have to switch them on and
> > > > off for every entry/exit to/from the VM.
> > > > 
> > > > Finally, we improve our VGIC handling by moving all save/restore logic
> > > > out of the VHE world-switch, and we make it possible to truly only
> > > > evaluate if the AP list is empty and not do *any* VGIC work if that is
> > > > the case, and only do the minimal amount of work required in the course
> > > > of the VGIC processing when we have virtual interrupts in flight.
> > > > 
> > > > The patches are based on v4.15-rc3, v9 of the level-triggered mapped
> > > > interrupts support series [1], and the first five patches of James' SDEI
> > > > series [2].
> > > > 
> > > > I've given the patches a fair amount of testing on Thunder-X, Mustang,
> > > > Seattle, and TC2 (32-bit) for non-VHE testing, and tested VHE
> > > > functionality on the Foundation model, running both 64-bit VMs and
> > > > 32-bit VMs side-by-side and using both GICv3-on-GICv3 and
> > > > GICv2-on-GICv3.
> > > > 
> > > > The patches are also available in the vhe-optimize-v3 branch on my
> > > > kernel.org repository [3].  The vhe-optimize-v3-base branch contains
> > > > prerequisites of this series.
> > > > 
> > > > Changes since v2:
> > > >  - Rebased on v4.15-rc3.
> > > >  - Includes two additional patches that only does vcpu_load after
> > > >    kvm_vcpu_first_run_init and only for KVM_RUN.
> > > >  - Addressed review comments from v2 (detailed changelogs are in the
> > > >    individual patches).
> > > > 
> > > > Thanks,
> > > > -Christoffer
> > > > 
> > > > [1]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git level-mapped-v9
> > > > [2]: git://linux-arm.org/linux-jm.git sdei/v5/base
> > > > [3]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git vhe-optimize-v3
> > > 
> > > I tested this v3 series on ThunderX2 with IPI benchmark:
> > > https://lkml.org/lkml/2017/12/11/364
> > > 
> > > I tried to address your comments in discussion to v2, like pinning
> > > the module to specific CPU (with taskset), increasing the number of
> > > iterations, tuning governor to max performance. Results didn't change
> > > much, and are pretty stable.
> > 
> > Thanks for testing this.
> > > 
> > > Comparing to vanilla guest, Norml IPI delivery for v3 is 20% slower.
> > > For v2 it was 27% slower, and for v1 - 42% faster. What's interesting,
> > > the acknowledge time is much faster for v3, so overall time to
> > > deliver and acknowledge IPI (2nd column) is less than vanilla
> > > 4.15-rc3 kernel.
> > 
> > I don't see this from your results.  It looks like an IPI cost increases
> > from 289 to 347?
> 
> I mean turnaround time - 497 without your patches and 490 with them.
> 

I have a hard time making sense of this; that would either indicate that
before we had something that was slow in your IPI workload, either in
a loop that shouldn't trap or from ktime_get() --- prior to it actually
returning the time --- which has now become faster (much faster given
the increase in send/receive IPI time), or the timers have become messed
up and show inconsistent time counts across the sending and receiving
CPUs.  Hmm.

> > Also, acknowledging the IPI should be a constant cost (handled directly
> > by hardware), so that's definitely an indication something is wrong.
> > 
> > > 
> > > Test setup is not changed since v2: ThunderX2, 112 online CPUs,
> > > guest is running under qemu-kvm, emulating gic version 3.
> > > 
> > > Below is test results for v1-3 normalized to host vanilla kernel
> > > dry-run time.
> > 
> > There must be some bug in this series, but I'm unsure where it is, as I
> > cannot observe it on the hardware I have at hand.
> > 
> > Perhaps we mistakenly enable the GICv3 CPU interface trapping with this
> > series or there is some other flow around the GIC which is broken.
> > 
> > It would be interesting if you could measure the base exit cost using
> > the cycle counter from the VM to the hypervisor between the two
> > platforms.  That does require changing the host kernel to clear
> > MDCR_EL2.TPM when running a guest (unsafe), and ensuring the cycle
> > counter runs across EL2/1/0 (for example by running KVM under perf) and
> > running a micro test that exits using a hypercall that does nothing
> > (like getting the PSCI version).
> 
> 
> I can do this, later this week, OK?
> 

That would be helpful indeed.

Thanks,
-Christoffer