[RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE

Wed Dec 2 11:35:03 EST 2020

Hi Haibo,

On 11/5/20 10:10 AM, Haibo Xu wrote:
> On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei at arm.com> wrote:
>> Stage 2 faults triggered by the profiling buffer attempting to write to
>> memory are reported by the SPE hardware by asserting a buffer management
>> event interrupt. Interrupts are by their nature asynchronous, which means
>> that the guest might have changed its stage 1 translation tables since the
>> attempted write. SPE reports the guest virtual address that caused the data
>> abort, but not the IPA, which means that KVM would have to walk the guest's
>> stage 1 tables to find the IPA; using the AT instruction to walk the
>> guest's tables in hardware is not an option because it doesn't report the
>> IPA in the case of a stage 2 fault on a stage 1 table walk.
>>
>> Fix both problems by pre-mapping the guest's memory at stage 2 with write
>> permissions to avoid any faults. Userspace calls mlock() on the VMAs that
>> back the guest's memory, pinning the pages in memory, then tells KVM to map
>> the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
>> with the attribute KVM_ARM_VM_SPE_FINALIZE. KVM will map all writable VMAs
>> which have the VM_LOCKED flag set. Hugetlb VMAs are practically pinned in
>> memory after they are faulted in and mlock() doesn't set the VM_LOCKED
>> flag, and just faults the pages in; KVM will treat hugetlb VMAs like they
>> have the VM_LOCKED flag and will also map them, faulting them in if
>> necessary, when handling the ioctl.
>>
>> VM live migration relies on a bitmap of dirty pages. This bitmap is created
>> by write-protecting a memslot and updating it as KVM handles stage 2 write
>> faults. Because KVM cannot handle stage 2 faults reported by the profiling
>> buffer, it will not pre-map a logging memslot. This effectively means that
>> profiling is not available when the VM is configured for live migration.
>>
>> Signed-off-by: Alexandru Elisei <alexandru.elisei at arm.com>
>> ---
>> [..]
> It seems that the below function is used to de-finalize the spe status
> if I get it correctly.
> How about rename the function to some like "kvm_arm_vcpu_init_spe_definalize()"

I don't have a strong opinion about the name and I'll keep your suggestion in mind
for the next iteration. The series is an RFC and the function might not even be
there in the final version.

>
>> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
>> +{
>> +       vcpu->kvm->arch.spe.finalized = false;
>> +}
>> +
>>  static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>>  {
>>         if (!vcpu_has_spe(vcpu))
>> @@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>         return -ENXIO;
>>  }
>>
>> +static int kvm_arm_spe_finalize(struct kvm *kvm)
>> +{
>> +       struct kvm_memory_slot *memslot;
>> +       enum kvm_pgtable_prot prot;
>> +       struct kvm_vcpu *vcpu;
>> +       int i, ret;
>> +
>> +       kvm_for_each_vcpu(i, vcpu, kvm) {
>> +               if (!kvm_arm_spe_vcpu_initialized(vcpu))
>> +                       return -ENXIO;
>> +       }
>> +
>> +       mutex_unlock(&kvm->slots_lock);
> Should be mutex_lock(&kvm->slots_lock);?

Definitely, nicely spotted! That's a typo on my part.

It doesn't affect the test results because kvmtool will call finalize exactly once
after the entire VM has been initialized, so there will be no concurrent accesses
to this function.

Thanks,

Alex