[PATCH v2 00/36] Optimize KVM/ARM for VHE systems

Mon Dec 11 07:34:58 PST 2017

Hi Yury,

On Mon, Dec 11, 2017 at 05:43:23PM +0300, Yury Norov wrote:
> 
> On Thu, Dec 07, 2017 at 06:05:54PM +0100, Christoffer Dall wrote:
> > This series redesigns parts of KVM/ARM to optimize the performance on
> > VHE systems.  The general approach is to try to do as little work as
> > possible when transitioning between the VM and the hypervisor.  This has
> > the benefit of lower latency when waiting for interrupts and delivering
> > virtual interrupts, and reduces the overhead of emulating behavior and
> > I/O in the host kernel.
> > 
> > Patches 01 through 04 are not VHE specific, but rework parts of KVM/ARM
> > that can be generally improved.  We then add infrastructure to move more
> > logic into vcpu_load and vcpu_put, we improve handling of VFP and debug
> > registers.
> > 
> > We then introduce a new world-switch function for VHE systems, which we
> > can tweak and optimize for VHE systems.  To do that, we rework a lot of
> > the system register save/restore handling and emulation code that may
> > need access to system registers, so that we can defer as many system
> > register save/restore operations to vcpu_load and vcpu_put, and move
> > this logic out of the VHE world switch function.
> > 
> > We then optimize the configuration of traps.  On non-VHE systems, both
> > the host and VM kernels run in EL1, but because the host kernel should
> > have full access to the underlying hardware, but the VM kernel should
> > not, we essentially make the host kernel more privileged than the VM
> > kernel despite them both running at the same privilege level by enabling
> > VE traps when entering the VM and disabling those traps when exiting the
> > VM.  On VHE systems, the host kernel runs in EL2 and has full access to
> > the hardware (as much as allowed by secure side software), and is
> > unaffected by the trap configuration.  That means we can configure the
> > traps for VMs running in EL1 once, and don't have to switch them on and
> > off for every entry/exit to/from the VM.
> > 
> > Finally, we improve our VGIC handling by moving all save/restore logic
> > out of the VHE world-switch, and we make it possible to truly only
> > evaluate if the AP list is empty and not do *any* VGIC work if that is
> > the case, and only do the minimal amount of work required in the course
> > of the VGIC processing when we have virtual interrupts in flight.
> > 
> > The patches are based on v4.15-rc1 plus the fixes sent for v4.15-rc3
> > [1], the level-triggered mapped interrupts support series [2], and the
> > first five patches of James' SDEI series [3], a single SVE patch that
> > moves the CPU ID reg trap setup out of the world-switch path, and v3 of
> > my vcpu load/put series [4].
> > 
> > I've given the patches a fair amount of testing on Thunder-X, Mustang,
> > Seattle, and TC2 (32-bit) for non-VHE testing, and tested VHE
> > functionality on the Foundation model, running both 64-bit VMs and
> > 32-bit VMs side-by-side and using both GICv3-on-GICv3 and
> > GICv2-on-GICv3.
> > 
> > The patches are also available in the vhe-optimize-v2 branch on my
> > kernel.org repository [5].
> > 
> > Changes since v1:
> >  - Rebased on v4.15-rc1 and newer versions of other dependencies,
> >    including the vcpu load/put approach taken for KVM.
> >  - Addressed review comments from v1 (detailed changelogs are in the
> >    individual patches).
> > 
> > Thanks,
> > -Christoffer
> > 
> > [1]: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm kvm-arm-fixes-for-v4.15-1
> > [2]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git level-mapped-v6
> > [3]: git://linux-arm.org/linux-jm.git sdei/v5/base
> > [4]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git vcpu-load-put-v3
> > [5]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git vhe-optimize-v2
> 
> I just submitted the benchmark I used to test your v1 and v2 series':
> https://lkml.org/lkml/2017/12/11/364
> 
> On ThunderX2, 112 online CPUs test results for v1 are like this:
> 
> Host, v4.14:
> Dry-run:          0         1
> Self-IPI:         9        18
> Normal IPI:      81       110
> Broadcast IPI:    0      2106
> 
> Guest, v4.14:
> Dry-run:          0         1
> Self-IPI:        10        18
> Normal IPI:     305       525
> Broadcast IPI:    0      9729
> 
> Guest, v4.14 + VHE:
> Dry-run:          0         1
> Self-IPI:         9        18
> Normal IPI:     176       343
> Broadcast IPI:    0      9885
> 
> And for v2.
> 
> Host, v4.15:                   
> Dry-run:          0         1
> Self-IPI:         9        18
> Normal IPI:      79       108
> Broadcast IPI:    0      2102
>                         
> Guest, v4.15-rc:
> Dry-run:          0         1
> Self-IPI:         9        18
> Normal IPI:     291       526
> Broadcast IPI:    0     10439
> 
> Guest, v4.15-rc + VHE:
> Dry-run:          0         2
> Self-IPI:        14        28
> Normal IPI:     370       569
> Broadcast IPI:    0     11688
> 
> All times are normalized to v1 host dry-run time. Smaller - better.
> 

Thanks for running this.

> Results for v1 and v2 may vary because kernel version is changed. 
> What makes us worry is slowing down the "Normal IPI" test observed in 
> v2 series.

I'm wondering if this is not simply variability in your measurements.
How many times have you run this?  The 100,000 iterations for each run
is not a lot if you consider the cost of migrating threads.

Is this workload pinned to a single CPU?  Is the system otherwise idle
(both host and guest)?  If you run this during boot or during kernel
module load, the results may be skewed by that.

Power management can greatly influence results as well.

Just so I'm sure we're reading these reults the same way, your "+ VHE"
notation means the VHE optimization series, but both the before and
after picture runs with VHE enabled, right?

Are you using the same guest kernel version and config for both your v1
and v2 results, and for both the before and after versions?

I can't easily come up with a scneario that explains the slowdown on the
normal IPI test, beyond some unfortunate bug introduced in v2.

> 
> Nevertheless, if you find test relevant, for v1 and v2,
> Tested-by: Yury Norov <ynorov at caviumnetworks.com>

Thanks,
-Christoffer