[RFC PATCH 4/4] arm64/sve: KVM: Basic SVE support

Thu Nov 23 10:49:29 PST 2017

On Thu, Nov 23, 2017 at 06:06:34PM +0000, Dave Martin wrote:
> On Wed, Nov 22, 2017 at 08:23:27PM +0100, Christoffer Dall wrote:
> > On Fri, Nov 17, 2017 at 04:38:55PM +0000, Dave Martin wrote:
> > > This patch is a flattened bunch of hacks for adding SVE support to
> > > KVM.  It's intended as a starting point for comments: it is not
> > > intended to be complete or final!
> > > 
> > > ** This patch has suspected bugs and has undergone minimal testing: do
> > > not merge **
> > > 
> > > Notes:
> > > 
> > > struct kvm_vpcu_arch does not currently include space for a guest's
> > > SVE state, so supporting SVE in guests requires some additional
> > > space to be allocated.  Statically allocating space per-vcpu is
> > > wasteful, especially if this allocation is based on the theoretical
> > > future arch maximum vector length SVE_VL_MAX.
> > > 
> > > A pointer to dynamically allocated memory would require that memory
> > > to be mapped into hyp.  Hyp mappings cannot currently be torn down
> > > dynamically, so this would result in a mess of kernel heap memory
> > > getting mapped into hyp over time.
> > > 
> > > This patch adopts a compromise: enough space is allocated at the
> > > end of each kvm_vcpu to store the SVE state, sized according to the
> > > maximum vector length supported by the hardware.  Naturally, if the
> > > hardware does not support SVE, no extra space is allocated at all.
> > > 
> > > Context switching implemented by adding alternative SVE code at
> > > each site where FPSIMD context is handled.  SVE is unconditionally
> > > provided to the guest is the host supports it.  This is a bit
> > > crude, but good enough for a proof-of-concept implementation.
> > > 
> > > ZCR_EL1 and ZCR_EL2 are added to the sys_regs list unconditionally,
> > > which will break userspace snapshot/restore compatibility.
> > > Possibly a more flexible approach is needed.  The inclusion of
> > > ZCR_EL2 here is a bit odd too: this is a feature configuration
> > > control rather than a guest register -- it is used to clamp the
> > > maximum vector length available to the guest.  Currently it is just
> > > set by default to correspond to the host's maximum.
> > > 
> > > Questions
> > > ---------
> > > 
> > >  * Should userspace be able to control the maximum SVE vector
> > >    length available to the guest, and what's the most natural way
> > >    to do it?
> > 
> > Yes, we should do this.  I think adding an attribute to the VCPU device
> > is the most natural way to do this, see:
> > 
> > Documentation/virtual/kvm/devices/vcpu.txt
> 
> I'll take a look.
> 
> > That way, you could read the maximum supported width and set the
> > requested width.
> 
> Is there an agreed policy on whether KVM supports heterogeneous compute
> clusters?

Let me put it this way: I don't think we want to make any decisions now
that prevents us from supporting heterogeneous clusters in the future.
This is something which is supported in part on x86, and I don't think
ARM should do any less.  However, I can't say that we've covered our
bases with all other existing decisions, but we plan to support this in
the future.

> 
> If a vcpu is never expected to run on a node different from the one it
> was created on, the ability to limit the max vector length available
> to the guest may not be useful.
> 

Even in that case, it could be useful for debug, test, and development.

> > > 
> > >    For example, it would be necessary to limit the vector length to
> > >    the lowest common denominator in order to support migration
> > >    across a cluster where the maximum hardware vector length
> > >    differs between nodes.
> > > 
> > >  * Combined alternatives are really awkward.  Is there any way to
> > >    use the standard static key based features tests in hyp?
> > 
> > Look at has_vhe(), that's a static key that we can use in hyp, and there
> > were some lengthy discussions about why that was true in the past.  We
> > also use the vgic_v3_cpuif_trap etc. static keys, so take a look at what
> > we do for those.
> 
> OK, I'll take a look.
> 
> > > TODO
> > > ----
> > > 
> > >  * Allow userspace feature control, to choose whether to expose SVE
> > >    to a guest.
> > > 
> > >  * Attempt to port some of the KVM entry code to C, at least for the
> > >    __fpsimd_guest_restore stuff.  The extra complexity with SVE looks
> > >    unsustainable.
> > 
> > This sounds like a good idea.
> 
> Dang. ;)
> 
> > > 
> > >  * Figure out ABI for exposing SVE regs via the ioctl interface.
> > > 
> > > *Bugs*
> > > ------
> > > 
> > > Currently there is nothing stopping KVM userspace from
> > > changing the guest's ZCR_EL2 after boot via the ioctl interface:
> > > this breaks architectural assumptions in the guest, and should
> > > really be forbidden.  Also, this is a latent trigger for
> > > buffer overruns, if creation of guests with limited VL is
> > > someday permitted.
> > 
> > Uerspace shouldn't be able to directly control ZCR_EL2, but ZCR_EL1.
> > The length constraint for ZCR_EL2 should be exported via some more
> > generic interface (see above) which prevents you from modifying it after
> > a vcpu has run (we have other similar uses, see
> > vcpu->arch.has_run_once), but ZCR_EL1 should be exported through the
> > normal KVM_SET/GET_ONE_REG IOCTL interface, and similar to all other
> > registers, if userspace tampers with it while the guest is running, it's
> > not going to be pretty for the guest, but shouldn't affect the host.
> 
> Agreed: putting ZCR_EL2 here is an ugly hack, but I think you've
> furnished me with better ways of doing this now.
> 
> > > 
> > > Signed-off-by: Dave Martin <Dave.Martin at arm.com>
> > > ---
> > >  arch/arm64/include/asm/fpsimdmacros.h |  8 +++++
> > >  arch/arm64/include/asm/kvm_host.h     | 30 ++++++++++++++++++
> > >  arch/arm64/include/asm/kvm_hyp.h      |  4 +++
> > >  arch/arm64/include/asm/sysreg.h       |  1 +
> > >  arch/arm64/kernel/asm-offsets.c       |  8 +++++
> > >  arch/arm64/kernel/entry-fpsimd.S      |  1 -
> > >  arch/arm64/kvm/handle_exit.c          |  2 +-
> > >  arch/arm64/kvm/hyp/entry.S            | 60 ++++++++++++++++++++++++++++-------
> > >  arch/arm64/kvm/hyp/fpsimd.S           | 12 +++++++
> > >  arch/arm64/kvm/hyp/hyp-entry.S        |  7 ++++
> > >  arch/arm64/kvm/hyp/switch.c           | 46 ++++++++++++++++++++++++++-
> > >  arch/arm64/kvm/reset.c                | 18 +++++++++++
> > >  arch/arm64/kvm/sys_regs.c             | 39 ++++++++++++++++-------
> > >  virt/kvm/arm/arm.c                    | 12 +++++--
> > >  14 files changed, 221 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h
> > > index e050d76..e2124c8 100644
> > > --- a/arch/arm64/include/asm/fpsimdmacros.h
> > > +++ b/arch/arm64/include/asm/fpsimdmacros.h
> > > @@ -17,6 +17,12 @@
> > >   * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> > >   */
> > >  
> > > +#ifndef __ARM64_FPSIMDMACROS_H
> > > +#define __ARM64_FPSIMDMACROS_H
> > > +
> > > +#include <asm/assembler.h>
> > > +#include <asm/sysreg.h>
> > > +
> > >  .macro fpsimd_save state, tmpnr
> > >  	stp	q0, q1, [\state, #16 * 0]
> > >  	stp	q2, q3, [\state, #16 * 2]
> > > @@ -223,3 +229,5 @@
> > >  		ldr		w\nxtmp, [\xpfpsr, #4]
> > >  		msr		fpcr, x\nxtmp
> > >  .endm
> > > +
> > > +#endif /* ! __ARM64_FPSIMDMACROS_H */
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 674912d..7045682 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -22,6 +22,7 @@
> > >  #ifndef __ARM64_KVM_HOST_H__
> > >  #define __ARM64_KVM_HOST_H__
> > >  
> > > +#include <linux/stddef.h>
> > >  #include <linux/types.h>
> > >  #include <linux/kvm_types.h>
> > >  #include <asm/cpufeature.h>
> > > @@ -29,6 +30,7 @@
> > >  #include <asm/kvm.h>
> > >  #include <asm/kvm_asm.h>
> > >  #include <asm/kvm_mmio.h>
> > > +#include <asm/sigcontext.h>
> > >  
> > >  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> > >  
> > > @@ -102,6 +104,8 @@ enum vcpu_sysreg {
> > >  	SCTLR_EL1,	/* System Control Register */
> > >  	ACTLR_EL1,	/* Auxiliary Control Register */
> > >  	CPACR_EL1,	/* Coprocessor Access Control */
> > > +	ZCR_EL1,	/* SVE Control */
> > > +	ZCR_EL2,	/* SVE Control (enforced by host) */
> > 
> > eh, this shouldn't be here (I don't suppose you plan on supporting
> > nested virtualization support of SVE just yet ?), but should be
> > described in the vcpu structure outside of the vcpu_sysreg array, which
> > really only contains guest-visible values.
> 
> Agreed, see above.
> 
> > 
> > >  	TTBR0_EL1,	/* Translation Table Base Register 0 */
> > >  	TTBR1_EL1,	/* Translation Table Base Register 1 */
> > >  	TCR_EL1,	/* Translation Control Register */
> > > @@ -202,6 +206,7 @@ struct kvm_vcpu_arch {
> > >  	/* HYP configuration */
> > >  	u64 hcr_el2;
> > >  	u32 mdcr_el2;
> > > +	u32 sve_ffr_offset;	/* offset of stored SVE FFR from ctxt base */
> > >  
> > >  	/* Exception Information */
> > >  	struct kvm_vcpu_fault_info fault;
> > > @@ -279,6 +284,31 @@ struct kvm_vcpu_arch {
> > >  	bool has_run_once;
> > >  };
> > >  
> > > +/*
> > > + * The size of SVE state is not known at compile time, so these helper
> > > + * macros are used to address state appended to the end of struct
> > > + * kvm_vcpu.
> > > + *
> > > + * There is currently no host SVE state, since a vcpu must run inside
> > > + * syscall context, and any cached SVE state of a second thread is
> > 
> > second thread?
> 
> "an unrelated host task"?
> 

right, ok.

> > > + * explicitly flushed before vcpu entry.
> > 
> > didn't you just change that in the previous patch to be after vcpu exit?
> 
> Err, yes, that should say "after".
> 
> 
> [...]
> 
> > > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> > > index 12ee62d..d8e8d22 100644
> > > --- a/arch/arm64/kvm/hyp/entry.S
> > > +++ b/arch/arm64/kvm/hyp/entry.S
> > > @@ -162,37 +162,75 @@ ENTRY(__fpsimd_guest_restore)
> > >  	stp	x2, x3, [sp, #-16]!
> > >  	stp	x4, lr, [sp, #-16]!
> > >  
> > > +	mrs	x0, cptr_el2
> > > +
> > > +alternative_if_not ARM64_SVE
> > > +	mov	x1, #CPTR_EL2_TFP
> > > +	mov	x2, #CPACR_EL1_FPEN
> > > +alternative_else
> > > +	mov	x1, #CPTR_EL2_TFP | CPTR_EL2_TZ
> > > +	mov	x2, #CPACR_EL1_FPEN | CPACR_EL1_ZEN
> > > +alternative_endif
> > > +
> > >  alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
> > > -	mrs	x2, cptr_el2
> > > -	bic	x2, x2, #CPTR_EL2_TFP
> > > -	msr	cptr_el2, x2
> > > +	bic	x0, x0, x1
> > >  alternative_else
> > > -	mrs	x2, cpacr_el1
> > > -	orr	x2, x2, #CPACR_EL1_FPEN
> > > -	msr	cpacr_el1, x2
> > > +	orr	x0, x0, x2
> > >  alternative_endif
> > > +
> > > +	msr	cptr_el2, x0
> > >  	isb
> > >  
> > > -	mrs	x3, tpidr_el2
> > > +	mrs	x4, tpidr_el2
> > >  
> > > -	ldr	x0, [x3, #VCPU_HOST_CONTEXT]
> > > +	ldr	x0, [x4, #VCPU_HOST_CONTEXT]
> > >  	kern_hyp_va x0
> > >  	add	x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
> > >  	bl	__fpsimd_save_state
> > >  
> > > -	add	x2, x3, #VCPU_CONTEXT
> > > +	add	x2, x4, #VCPU_CONTEXT
> > > +
> > > +#ifdef CONFIG_ARM64_SVE
> > > +alternative_if ARM64_SVE
> > > +	b	2f
> > > +alternative_else_nop_endif
> > > +#endif
> > > +
> > >  	add	x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
> > >  	bl	__fpsimd_restore_state
> > >  
> > > +0:
> > >  	// Skip restoring fpexc32 for AArch64 guests
> > >  	mrs	x1, hcr_el2
> > >  	tbnz	x1, #HCR_RW_SHIFT, 1f
> > > -	ldr	x4, [x3, #VCPU_FPEXC32_EL2]
> > > -	msr	fpexc32_el2, x4
> > > +	ldr	x3, [x4, #VCPU_FPEXC32_EL2]
> > > +	msr	fpexc32_el2, x3
> > >  1:
> > >  	ldp	x4, lr, [sp], #16
> > >  	ldp	x2, x3, [sp], #16
> > >  	ldp	x0, x1, [sp], #16
> > >  
> > >  	eret
> > > +
> > > +#ifdef CONFIG_ARM64_SVE
> > > +2:
> > > +alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
> > > +	ldr	x0, [x4, #VCPU_ZCR_EL2]
> > > +	ldr	x3, [x4, #VCPU_ZCR_EL1]
> > > +	msr_s	SYS_ZCR_EL2, x0
> > > +alternative_else
> > > +	ldr	x0, [x4, #VCPU_ZCR_EL1]
> > > +	ldr	x3, [x4, #VCPU_ZCR_EL2]
> > > +	msr_s	SYS_ZCR_EL12, x0
> > > +alternative_endif
> > > +
> > > +	ldr	w0, [x4, #VCPU_SVE_FFR_OFFSET]
> > > +	add	x0, x4, x0
> > > +	add	x1, x2, #VCPU_FPSR - VCPU_CONTEXT
> > > +	mov	x2, x3
> > > +	bl	__sve_load_state
> > > +
> > > +	b	0b
> > > +#endif /* CONFIG_ARM64_SVE */
> > 
> > I think if we could move all of this out to C code that would be much
> > nicer.  Also, we currently do the short-circuit fault-in-fpsimd state
> > inside hyp and always save the fpsimd state when returning to the host
> > kernel, but in my optimization series I change this to leave the state
> > in hardware until we schedule or return to userspace, so perhaps we can
> > go all the way back to handle_exit() then without much performance loss,
> > which may further simplify the SVE support?
> 
> This is what I would be aiming for, coupled with awareness of whether
> the host has any live FPSIMD/SVE state in the first place: for SVE
> specifically the answer will _usually_ be "no".
> 
> Ideally, we would give a vcpu its own SVE/FPSIMD context on a par with
> host tasks and just reuse the fpsimd.c machinery rather than inventing
> separate machinery for KVM -- the flipside would be that the hyp code
> may need to mainpulate thread flags etc. -- I didn't want to rush to
> this destination before learning to walk though.

Fair enough, but then I think the "let's rely on VHE for SVE support"
comes in, because you can mostly think of hyp code as being the host
kernel code.

> 
> For the first iteration of KVM SVE support I would rather not do that
> rafactoring, to reduce the number of things I'm breaking at the same
> time...
> 

On the other hand, if it makes the SVE support code much simpler, it may
be easier in the end.  It may be worth thinking of a way we can measure
the two approaches with basic FPSIMD so that we know the consequences of
refactoring anything.

> > 
> > The hit would obviously be a slightly higher latency on the first fault
> > in the guest on SVE/fpsimd.
> 
> For SVE it's probably not the end of the world since that's higher
> cost to save/restore in the first place.
> 
> For FPSIMD I've less of a feel for it.
> 

Yeah, me neither.  We'd have to measure something.

> [...]
> 
> > > +static hyp_alternate_select(__fpsimd_guest_save,
> > > +			    __fpsimd_guest_save_nsve, __fpsimd_guest_save_sve,
> > > +			    ARM64_SVE);
> > > +
> > 
> > How likely is it that we'll have SVE implementations without VHE? If
> > completely unlikely, perhaps it's worth considering just disabling SVE
> > on non-VHE systems.
> 
> I did wonder about this, but the architectural position is that such
> a configuration is allowed.
> 
> I think that most SVE implementations will have VHE in practice, so
> non-VHE might be supported as a later addition if supporting it is
> invasive.  But I'll see what it looks like after cleanup in line
> with your other suggestions.

Ok, sounds good.

> 
> [...]
> 
> > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > index 3256b92..b105e54 100644
> > > --- a/arch/arm64/kvm/reset.c
> > > +++ b/arch/arm64/kvm/reset.c
> > > @@ -26,7 +26,9 @@
> > >  
> > >  #include <kvm/arm_arch_timer.h>
> > >  
> > > +#include <asm/cpufeature.h>
> > >  #include <asm/cputype.h>
> > > +#include <asm/fpsimd.h>
> > >  #include <asm/ptrace.h>
> > >  #include <asm/kvm_arm.h>
> > >  #include <asm/kvm_asm.h>
> > > @@ -88,6 +90,22 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
> > >  	return r;
> > >  }
> > >  
> > > +size_t arch_kvm_vcpu_size(void)
> > > +{
> > > +	unsigned int vq;
> > > +	struct kvm_vcpu vcpu;	/* dummy struct used for size calculation */
> > > +	char *p;
> > > +
> > > +	if (!system_supports_sve())
> > > +		return sizeof(struct kvm_vcpu);
> > > +
> > > +	BUG_ON(!sve_vl_valid(sve_max_vl));
> > > +	vq = sve_vq_from_vl(sve_max_vl);
> > > +
> > > +	p = vcpu_guest_sve_state(&vcpu) + vcpu_guest_sve_size(vq);
> > > +	return p - (char *)&vcpu;
> > 
> > Why are we doing things this way as opposed to simply having a pointer
> > to SVE state in the VCPU structure which we allocate as a separate
> > structure when setting up a VM (and we can then choose to only allocate
> > this memory if SVE is actually supported for the guest) ?
> > 
> > This all seems very counter-intuitive.
> 
> I disliked the idea of mapping random kernel heap memory into hyp,
> but I guess we already do that, since vcpu kmem cache memory may
> get recycled for other purposes.
> 
> I guess I was trying to dodge the need to understang anything
> about the hyp memory mapping, but as you can see here I failed
> anyway.
> 
> 
> Do you consider mapping random kernel heap into hyp acceptable?

Yes, that's pretty much what we do for the kvm_cpu_context_t per CPU, I
think.

See how we use create_hyp_mappings().

> Ideally we would refcount mappings and remove any pages that
> are no longer needed, but that's extra effort and may be best
> dealt with separately.
> 

We have so far taken the policy that we don't bother.  Yes, some extra
things may be mapped from hyp code, but everything is mapped from the
host anyway, so I don't see a strong security hole here.

If we happen to reuse some VA space for something else later on, we
simply "overwrite" the mapping as needed.

> > > +}
> > > +
> > >  /**
> > >   * kvm_reset_vcpu - sets core registers and sys_regs to reset value
> > >   * @vcpu: The VCPU pointer
> > > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> > > index 1830ebc..fb86907 100644
> > > --- a/arch/arm64/kvm/sys_regs.c
> > > +++ b/arch/arm64/kvm/sys_regs.c
> > > @@ -481,6 +481,29 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
> > >  	vcpu_sys_reg(vcpu, PMCR_EL0) = val;
> > >  }
> > >  
> > > +static void reset_zcr_el2(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
> > > +{
> > > +	u64 val = 0;
> > > +	unsigned int vq;
> > > +	char *p;
> > > +
> > > +	if (system_supports_sve()) {
> > > +		BUG_ON(!sve_vl_valid(sve_max_vl));
> > 
> > This seems to be checked more than one place in this patch.  Is this not
> > something that the kernel can verify somewhere else at boot as opposed
> > to all over the place at run time?
> 
> Treat this as debug.  It should mostly go away.
> 
> I wanted to sanity-check that SVE is fully probed in the host kernel
> before we can get here.  I think this is probably true by construction,
> but for development purposes I'm being more paranoid.
> 

ok, fair enough.  I'll try to be less nit-picky.

> > 
> > > +		vq = sve_vq_from_vl(sve_max_vl);
> > > +		val = vq - 1;
> > > +
> > > +		/*
> > > +		 * It's a bit gross to do this here.  But sve_ffr_offset is
> > > +		 * essentially a cached form of ZCR_EL2 for use by the hyp asm
> > > +		 * code, and I don't want to split the logic:
> > > +		 */
> > 
> > This comment will be more concerning to people reading the code than
> > helpful I think.
> > 
> > IIUC, the point is that we can't know where the FFRs are because it's
> > all dependent on the vector length, so the place where we initialize the
> > vector length is where we set the pointer, right?
> 
> Yes-ish.
> 
> > That makes good sense, and thus makes it non-gross, but I wonder how
> > this looks once we've parameterized the SVE support.
> 
> This is code is really a proof-of-concept hack.  What I really meant
> here is that I don't like the way the code is currently factored --
> I expect to substantially rewrite it before a non-RFC posting.
> 
> > Essentially, I think you're trying to initalize the per-VCPU KVM support
> > for SVE from within a framework designed to initialize the VCPU's state,
> > so I would expect that you have reset_zcr_el1 here, looking quite
> > different, and no reset_zcr_el2 until you support nested virt with SVE
> > for the guest hypervisor, as per my comment on the data structure.
> > 
> > Also, if we can have a structure for the gust SVE state with a pointer
> > from the vcpu struct, the FFR pointer can be embedded inside the struct,
> > making it slightly more contained.
> 
> Yes, that may work.  Or I may find a cleaner way to deduce the pointer
> at the appropriate time rather than having to cache an offset.
> 
> The real cause of this is that the code to calculate the offset is C
> but I need to use it from asm.  If I port the affected parts of the
> hyp entry code to C anyway this problem probably disappears.
> 

Sounds good.

Thanks,
-Christoffer