[PATCH v2] KVM: arm64: Allow to limit number of PMU counters

Thu Sep 10 12:46:32 EDT 2020

On 10.09.20 17:52, Robin Murphy wrote:
> 
> On 2020-09-10 11:18, Alexander Graf wrote:
>>
>>
>> On 10.09.20 12:06, Marc Zyngier wrote:
>>>
>>> On 2020-09-08 21:57, Alexander Graf wrote:
>>>> We currently pass through the number of PMU counters that we have
>>>> available
>>>> in hardware to guests. So if my host supports 10 concurrently active
>>>> PMU
>>>> counters, my guest will be able to spawn 10 counters as well.
>>>>
>>>> This is undesireable if we also want to use the PMU on the host for
>>>> monitoring. In that case, we want to split the PMU between guest and
>>>> host.
>>>>
>>>> To help that case, let's add a PMU attr that allows us to limit the
>>>> number
>>>> of PMU counters that we expose. With this patch in place, user space
>>>> can
>>>> keep some counters free for host use.
>>>>
>>>> Signed-off-by: Alexander Graf <graf at amazon.com>
>>>>
>>>> ---
>>>>
>>>> Because this patch touches the same code paths as the vPMU filtering
>>>> one
>>>> and the vPMU filtering generalized a few conditions in the attr path,
>>>> I've based it on top. Please let me know if you want it independent
>>>> instead.
>>>>
>>>> v1 -> v2:
>>>>
>>>>   - Add documentation
>>>>   - Add read support
>>>> ---
>>>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>>>  arch/arm64/kvm/pmu-emul.c               | 32
>>>> ++++++++++++++++++++++++++++++++
>>>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>>>  include/kvm/arm_pmu.h                   |  1 +
>>>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>>>> b/Documentation/virt/kvm/devices/vcpu.rst
>>>> index 203b91e93151..1a1c8d8c8b1d 100644
>>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>>>> cycle counter is possible
>>>>  using event 0x11 (CPU_CYCLES).
>>>>
>>>>
>>>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>>>> +---------------------------------------------
>>>> +
>>>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>>>> concurrent
>>>> +             events is a pointer to an int
>>>> +
>>>> +:Returns:
>>>> +
>>>> +      =======  ======================================================
>>>> +      -ENODEV: PMUv3 not supported
>>>> +      -EBUSY:  PMUv3 already initialized
>>>> +      -EINVAL: Too large number of events
>>>> +      =======  ======================================================
>>>> +
>>>> +Reconfigure the limit of concurrent PMU events that the guest can
>>>> monitor.
>>>> +This number is directly exposed as part of the PMCR_EL0 register.
>>>> +
>>>> +On vcpu creation, this attribute is set to the hardware limit of the
>>>> current
>>>> +platform. If you need to determine the hardware limit, you can read
>>>> this
>>>> +attribute before setting it.
>>>> +
>>>> +Restrictions: The default value for this property is the number of
>>>> hardware
>>>> +supported events. Only values that are smaller than the hardware limit
>>>> can
>>>> +be set.
>>>> +
>>>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>>>  =================================
>>>>
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 7b1511d6ce44..db025c0b5a40 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>>>
>>>>  /* Device Control API on vcpu fd */
>>>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>>>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>>>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>>>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>>>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>>>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>>>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>>>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>>> index 0458860bade2..c7915b95fec0 100644
>>>> --- a/arch/arm64/kvm/pmu-emul.c
>>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>>>
>>>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>>>               pmu->pmc[i].idx = i;
>>>> +
>>>> +     pmu->num_events = perf_num_counters() - 1;
>>>>  }
>>>>
>>>>  /**
>>>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>
>>>>               return 0;
>>>>       }
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK <<
>>>> ARMV8_PMU_PMCR_N_SHIFT;
>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>> +             u32 num_events;
>>>> +
>>>> +             if (get_user(num_events, uaddr))
>>>> +                     return -EFAULT;
>>>> +
>>>> +             if (num_events >= perf_num_counters())
>>>> +                     return -EINVAL;
>>>> +
>>>> +             vcpu->arch.pmu.num_events = num_events;
>>>> +
>>>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>>>> +
>>>> +             return 0;
>>>> +     }
>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>               return kvm_arm_pmu_v3_init(vcpu);
>>>>       }
>>>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>               irq = vcpu->arch.pmu.irq_num;
>>>>               return put_user(irq, uaddr);
>>>>       }
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>> +             u32 num_events;
>>>> +
>>>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>> +                     return -ENODEV;
>>>> +
>>>> +             num_events = vcpu->arch.pmu.num_events;
>>>> +             return put_user(num_events, uaddr);
>>>> +     }
>>>>       }
>>>>
>>>>       return -ENXIO;
>>>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>>>               if (kvm_arm_support_pmu_v3() &&
>>>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>>                       return 0;
>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>> index 20ab2a7d37ca..d51e39600bbd 100644
>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>>>> const struct sys_reg_desc *r)
>>>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) &
>>>> (~ARMV8_PMU_PMCR_E);
>>>>       if (!system_supports_32bit_el0())
>>>>               val |= ARMV8_PMU_PMCR_LC;
>>>> +
>>>> +     /* Override number of event selectors */
>>>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>>>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>>>> +
>>>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>>>  }
>>>>
>>>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>>>> index 98cbfe885a53..ea3fc96a37d9 100644
>>>> --- a/include/kvm/arm_pmu.h
>>>> +++ b/include/kvm/arm_pmu.h
>>>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>>>       bool ready;
>>>>       bool created;
>>>>       bool irq_level;
>>>> +     u8 num_events;
>>>>  };
>>>>
>>>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
>>>
>>> I see several problems with this approach:
>>>
>>> - userspace doesn't really have a good way to retrieve the number of
>>>    counters.
>> It does with v2, because it can then just read the register ;). I agree
>> that it's clunky though.
>>
>>>
>>> - Limiting the number of counters for the guest doesn't mean anything
>>>    when it comes to the actual use of the HW counters, given that we
>>>    don't allocate them ourselves (it's all perf doing the actual work).
>>
>> We do cap the number of actively requestable counters via perf by the
>> PMCR.N limit. So in a way, it does mean something.
>>
>>> - If you want to "pin" counters for the host, why don't you just do
>>>    that before starting the guest?
>>
>> You can do that. Imagine I have 10 counters. I pin 4 of them to the
>> host. I still tell my guest that it can use 6. That means perf will then
>> time slice and juggle 10 guest event counters on those remaining 6
>> hardware counters. That juggling heavily reduces accuracy.
>>
>>> I think you need to look at the bigger picture: how to limit the use
>>> of physical counter usage for a given userspace task. This needs
>>> to happen in perf itself, and not in KVM.
>>
>> That's definitely another way to look at it that I agree with.
>>
>> What we really want is to expose the number of counters the guest has
>> available, not the number of counters hardware can support at maximum.
>>
>> So in theory it would be enough to ask perf how many counters it does
>> have free for me to consume without overcommitting. But that would
>> potentially change between multiple invocations of KVM and thus break
>> things like live migration, no?
>>
>> Maybe what we really want is an interface to perf from user space to say
>> "how many counters can you dedicate to me?" and "reserve them for me".
>> Then user space could reserve them as dedicated counters and KVM would
>> just need to either probe for the reservation or get told by user space
>> what to expose via ONE_REG as Drew suggested. It'd be up to user space
>> to ensure that the reservation matches the number of exposed counters 
>> then.
> 
> Note that if the aim is to avoid the guest seeing unexpectedly weird
> behaviour, then it's not just the *number* of counters that matters, but
> the underlying physical allocation too, thanks to the possibility of
> chained events.

Wouldn't ideally guest chaining propagate into host chaining as well? 
I'd have to double check if it does, but in my naive thinking if I 
reserve 4 hardware counters for the guest and the guest ends up using 4 
hardware counters regardless of their chaining attributes, I'd still be 
able to fit them all?

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879