[RFC PATCH 4/4] KVM: arm64: Introduce KVM_CAP_ARM_PROTECTED_VM

Thu Jun 3 13:15:55 PDT 2021

On Thu, Jun 03, 2021, Will Deacon wrote:
> +Enabling this capability causes all memory slots of the specified VM to be
> +unmapped from the host system and put into a state where they are no longer
> +configurable.

What happens if userspace remaps the hva of the memslot?  In other words, how
will the implementation ensure the physical memory backing the guest is truly
inaccessible?

And how will the guest/host share memory?  E.g. what if the guest wants to set
up a new shared memory region and the host wants that to happen in a new memslot?

On a related topic, would the concept of guest-only memory be useful[*]?  The
idea is to require userspace to map guest-private memory with a dedicated VMA
flag.  That will allow the host kernel to prevent userspace (or the kernel) from
mapping guest-private memory, and will also allow KVM to verify that memory that
the guest wants to be private is indeed private.

[*] https://lkml.kernel.org/r/20210416154106.23721-14-kirill.shutemov@linux.intel.com

> The memory slot corresponding to the ID passed in args[0] is
> +populated with the guest firmware image provided by the host firmware.

Why take a slot instead of an address range?  I assume copying the host firmware
into guest memory will utilize existing APIs, i.e. the memslot lookups to resolve
the GPA->HVA will Just Work (TM).

> +The first vCPU to enter the guest is defined to be the primary vCPU. All other
> +vCPUs belonging to the VM are secondary vCPUs.
> +
> +All vCPUs belonging to a VM with this capability enabled are initialised to a
> +pre-determined reset state irrespective of any prior configuration according to
> +the KVM_ARM_VCPU_INIT ioctl, with the following exceptions for the primary
> +vCPU:
> +
> +	===========	===========
> +	Register(s)	Reset value
> +	===========	===========
> +	X0-X14:		Preserved (see KVM_SET_ONE_REG)

What's the intended use case for allowing userspace to set a pile of registers?
Giving userspace control of vCPU state is tricky because the state is untrusted.
Is the state going to be attested in any way, or is the VMM trusted while the
VM is being created and only de-privileged later on?

> +	X15:		Boot protocol version (0)
> +	X16-X30:	Reserved (0)
> +	PC:		IPA base of firmware memory slot
> +	SP:		IPA end of firmware memory slot
> +	===========	===========
> +

...

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..e1d4a87d18e4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1349,6 +1349,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	bool writable = !(mem->flags & KVM_MEM_READONLY);
>  	int ret = 0;
>  
> +	if (kvm_vm_is_protected(kvm))
> +		return -EPERM;

FWIW, this will miss the benign corner case where KVM_SET_USER_MEMORY_REGION
does not actualy change anything about an existing region.

A more practical problem is that this check won't happen until KVM has marked
the existing region as invalid in the DELETE or MOVE case.  So userspace can't
completely delete a region, but it can temporarily delete a region by briefly
making it in invalid.  No idea what the ramifications of that are on arm64.

>  	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>  			change != KVM_MR_FLAGS_ONLY)
>  		return 0;

...

> +static int pkvm_vm_ioctl_enable(struct kvm *kvm, u64 slotid)
> +{
> +	int ret = 0;

Deep into the nits: technically unnecessary since ret is guaranteed to be written
before being consumed.

> +	mutex_lock(&kvm->lock);
> +	if (kvm_vm_is_protected(kvm)) {
> +		ret = -EPERM;
> +		goto out_kvm_unlock;
> +	}
> +
> +	mutex_lock(&kvm->slots_lock);
> +	ret = pkvm_enable(kvm, slotid);
> +	if (ret)
> +		goto out_slots_unlock;
> +
> +	kvm->arch.pkvm.enabled = true;
> +out_slots_unlock:
> +	mutex_unlock(&kvm->slots_lock);
> +out_kvm_unlock:
> +	mutex_unlock(&kvm->lock);
> +	return ret;
> +}

...

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 3fd9a7e9d90c..58ab8508be5e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_SGX_ATTRIBUTE 196
>  #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
>  #define KVM_CAP_PTP_KVM 198
> +#define KVM_CAP_ARM_PROTECTED_VM 199

Rather than a dedicated (and arm-only) capability for protected VMs, what about
adding a capability to retrieve the supported VM types?  And obviously make
protected VMs a different type that must be specified at VM creation (there's
already plumbing for this).  Even if there's no current need to know at creation
time that a VM will be protected, it's also unlikely to be a burden on userspace
and could be nice to have down the road.  The late "activation" ioctl() can still
be supported, e.g. to start disallowing memslot updates.

x86 with TDX would look like:

        case KVM_CAP_VM_TYPES:
                r = BIT(KVM_X86_LEGACY_VM);
                if (kvm_x86_ops.is_vm_type_supported(KVM_X86_TDX_VM))
                        r |= BIT(KVM_X86_TDX_VM);
                break;

and arm64 would look like?

	case KVM_CAP_VM_TYPES:
		r = BIT(KVM_ARM64_LEGACY_VM);
		if (kvm_get_mode() == KVM_MODE_PROTECTED)
			r |= BIT(KVM_ARM64_PROTECTED_VM);
		break;

>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> -- 
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>