[PATCH v2 00/17] KVM/ARM: Guest Entry/Exit optimizations

Wed Feb 17 08:40:32 PST 2016

I've recently been looking at our entry/exit costs, and profiling
figures did show some very low hanging fruits.

The most obvious cost is that accessing the GIC HW is slow. As in
"deadly slow", specially when GICv2 is involved. So not hammering the
HW when there is nothing to write (and even to read) is immediately
beneficial, as this is the most common cases (whatever people seem to
think, interrupts are a *rare* event). Similar work has also been done
for GICv3, with a reduced impact (it was less "bad" to start with).

Another easy thing to fix is the way we handle trapped system
registers. We do insist on (mostly) sorting them, but we do perform a
linear search on trap. We can switch to a binary search for free, and
get immediate benefits (the PMU code, being extremely trap-happy,
benefits immediately from this).

With these in place, I see an improvement of 10 to 40% (depending on
the platform) on our world-switch cycle count when running a set of
hand-crafted guests that are designed to only perform traps.

Please note that VM exits are actually a rare event on ARM. So don't
expect your guest to be 40% faster, this will hardly make a noticable
difference.

Methodology:

* NULL-hypercall guest: Perform 2^20 PSCI_0_2_FN_PSCI_VERSION calls,
and then a power-off:

__start:
	mov	x19, #(1 << 16)
1:	mov	x0, #0x84000000
	hvc	#0
	sub	x19, x19, #1
	cbnz	x19, 1b
	mov	x0, #0x84000000
	add	x0, x0, #9
	hvc	#0
	b	.

* Self IPI guest: Inject and handle 2^20 SGI0 using GICv2 or GICv3,
and then power-off:

__start:
	mov	x19, #(1 << 20)

	mrs	x0, id_aa64pfr0_el1
	ubfx	x0, x0, #24, #4
	and	x0, x0, #0xf
	cbz	x0, do_v2

	mrs	x0, s3_0_c12_c12_5	// ICC_SRE_EL1
	and	x0, x0, #1		// SRE bit
	cbnz	x0, do_v3

do_v2:
	mov	x0, #0x3fff0000		// Dist
	mov	x1, #0x3ffd0000		// CPU
	mov	w2, #1
	str	w2, [x0]		// Enable Group0
	ldr	w2, =0xa0a0a0a0
	str	w2, [x0, 0x400]		// A0 priority for SGI0-3
	mov	w2, #0x0f
	str	w2, [x0, #0x100]	// Enable SGI0-3
	mov	w2, #0xf0
	str	w2, [x1, #4]		// PMR
	mov	w2, #1
	str	w2, [x1]		// Enable CPU interface

1:
	mov	w2, #(2 << 24)		// Interrupt self with SGI0
	str	w2, [x0, #0xf00]

2:	ldr	w2, [x1, #0x0c]		// GICC_IAR
	cmp	w2, #0x3ff
	b.ne	3f

	wfi
	b	2b

3:	str	w2, [x1, #0x10]		// EOI

	sub	x19, x19, #1
	cbnz	x19, 1b

die:
	mov	x0, #0x84000000
	add	x0, x0, #9
	hvc	#0
	b	.

do_v3:
	mov	x0, #0x3fff0000		// Dist
	mov	x1, #0x3fbf0000		// Redist 0
	mov	x2, #0x10000
	add	x1, x1, x2		// SGI page
	mov	w2, #2
	str	w2, [x0]		// Enable Group1
	ldr	w2, =0xa0a0a0a0
	str	w2, [x1, 0x400]		// A0 priority for SGI0-3
	mov	w2, #0x0f
	str	w2, [x1, #0x100]	// Enable SGI0-3
	mov	w2, #0xf0
	msr	S3_0_c4_c6_0, x2	// PMR
	mov	w2, #1
	msr	S3_0_C12_C12_7, x2	// Enable Group1

1:
	mov	x2, #1
	msr	S3_0_c12_c11_5, x2	// Self SGI0

2:	mrs	x2, S3_0_c12_c12_0	// Read IAR1
	cmp	w2, #0x3ff
	b.ne	3f

	wfi
	b	2b

3:	msr	S3_0_c12_c12_1, x2	// EOI

	sub	x19, x19, #1
	cbnz	x19, 1b

	b	die

* sysreg trap guest: Perform 2^20 PMSELR_EL0 accesses, and power-off:

__start:
	mov	x19, #(1 << 20)
1:	mrs	x0, PMSELR_EL0
	sub	x19, x19, #1
	cbnz	x19, 1b
	mov	x0, #0x84000000
	add	x0, x0, #9
	hvc	#0
	b	.

* These guests are profiled using perf and kvmtool:

taskset -c 1 perf stat -e cycles:kh lkvm run -c1 --kernel do_sysreg.bin 2>&1 >/dev/null| grep cycles

The result is then divided by the number of iterations (2^20).

These tests have been run on three different platform (two GICv2
based, and one with GICv3 and legacy mode) and shown significant
improvements in all cases. I've only touched the arm64 GIC code, but
obviously the 32bit code should use it as well once we've migrated it
to C.

Vanilla v4.5-rc4
	     A             B            C-v2         C-v3
Null HVC:   8462          6566          6572         6505
Self SGI:  11961          8690          9541         8629
SysReg:     8952          6979          7212         7180

Patched v4.5-rc4
	     A             B            C-v2         C-v3
Null HVC:   5219  -38%    3957  -39%    5175  -21%   5158  -20%
Self SGI:   8946  -25%    6658  -23%    8547  -10%   7299  -15%
SysReg:     5314  -40%    4190  -40%    5417  -25%   5414  -24%

I've pushed out a branch (kvm-arm64/suck-less) to the usual location,
based on -rc4 + a few fixes I also posted today.

Thanks,

	M.

* From v1:
  - Fixed a nasty bug dealing with the active Priority Register
  - Maintenance interrupt lazy saving
  - More LR hackery
  - Adapted most of the series for GICv3 as well

Marc Zyngier (17):
  arm64: KVM: Switch the sys_reg search to be a binary search
  ARM: KVM: Properly sort the invariant table
  ARM: KVM: Enforce sorting of all CP tables
  ARM: KVM: Rename struct coproc_reg::is_64 to is_64bit
  ARM: KVM: Switch the CP reg search to be a binary search
  KVM: arm/arm64: timer: Add active state caching
  arm64: KVM: vgic-v2: Avoid accessing GICH registers
  arm64: KVM: vgic-v2: Save maintenance interrupt state only if required
  arm64: KVM: vgic-v2: Move GICH_ELRSR saving to its own function
  arm64: KVM: vgic-v2: Do not save an LR known to be empty
  arm64: KVM: vgic-v2: Only wipe LRs on vcpu exit
  arm64: KVM: vgic-v2: Make GICD_SGIR quicker to hit
  arm64: KVM: vgic-v3: Avoid accessing ICH registers
  arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
  arm64: KVM: vgic-v3: Do not save an LR known to be empty
  arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
  arm64: KVM: vgic-v3: Do not save ICH_AP0Rn_EL2 for GICv2 emulation

 arch/arm/kvm/arm.c              |   1 +
 arch/arm/kvm/coproc.c           |  74 +++++----
 arch/arm/kvm/coproc.h           |   8 +-
 arch/arm64/kvm/hyp/vgic-v2-sr.c | 144 +++++++++++++----
 arch/arm64/kvm/hyp/vgic-v3-sr.c | 333 ++++++++++++++++++++++++++--------------
 arch/arm64/kvm/sys_regs.c       |  40 ++---
 include/kvm/arm_arch_timer.h    |   5 +
 include/kvm/arm_vgic.h          |   8 +-
 virt/kvm/arm/arch_timer.c       |  31 ++++
 virt/kvm/arm/vgic-v2-emul.c     |  10 +-
 virt/kvm/arm/vgic-v3.c          |   4 +-
 11 files changed, 452 insertions(+), 206 deletions(-)

-- 
2.1.4