[PATCH v2] KVM: arm64: Allow to limit number of PMU counters

Thu Sep 10 11:52:46 EDT 2020

On 2020-09-10 11:18, Alexander Graf wrote:
> 
> 
> On 10.09.20 12:06, Marc Zyngier wrote:
>>
>> On 2020-09-08 21:57, Alexander Graf wrote:
>>> We currently pass through the number of PMU counters that we have
>>> available
>>> in hardware to guests. So if my host supports 10 concurrently active
>>> PMU
>>> counters, my guest will be able to spawn 10 counters as well.
>>>
>>> This is undesireable if we also want to use the PMU on the host for
>>> monitoring. In that case, we want to split the PMU between guest and
>>> host.
>>>
>>> To help that case, let's add a PMU attr that allows us to limit the
>>> number
>>> of PMU counters that we expose. With this patch in place, user space
>>> can
>>> keep some counters free for host use.
>>>
>>> Signed-off-by: Alexander Graf <graf at amazon.com>
>>>
>>> ---
>>>
>>> Because this patch touches the same code paths as the vPMU filtering
>>> one
>>> and the vPMU filtering generalized a few conditions in the attr path,
>>> I've based it on top. Please let me know if you want it independent
>>> instead.
>>>
>>> v1 -> v2:
>>>
>>>   - Add documentation
>>>   - Add read support
>>> ---
>>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>>  arch/arm64/kvm/pmu-emul.c               | 32
>>> ++++++++++++++++++++++++++++++++
>>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>>  include/kvm/arm_pmu.h                   |  1 +
>>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>>> b/Documentation/virt/kvm/devices/vcpu.rst
>>> index 203b91e93151..1a1c8d8c8b1d 100644
>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>>> cycle counter is possible
>>>  using event 0x11 (CPU_CYCLES).
>>>
>>>
>>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>>> +---------------------------------------------
>>> +
>>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>>> concurrent
>>> +             events is a pointer to an int
>>> +
>>> +:Returns:
>>> +
>>> +      =======  ======================================================
>>> +      -ENODEV: PMUv3 not supported
>>> +      -EBUSY:  PMUv3 already initialized
>>> +      -EINVAL: Too large number of events
>>> +      =======  ======================================================
>>> +
>>> +Reconfigure the limit of concurrent PMU events that the guest can
>>> monitor.
>>> +This number is directly exposed as part of the PMCR_EL0 register.
>>> +
>>> +On vcpu creation, this attribute is set to the hardware limit of the
>>> current
>>> +platform. If you need to determine the hardware limit, you can read
>>> this
>>> +attribute before setting it.
>>> +
>>> +Restrictions: The default value for this property is the number of
>>> hardware
>>> +supported events. Only values that are smaller than the hardware limit
>>> can
>>> +be set.
>>> +
>>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>>  =================================
>>>
>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>> b/arch/arm64/include/uapi/asm/kvm.h
>>> index 7b1511d6ce44..db025c0b5a40 100644
>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>>
>>>  /* Device Control API on vcpu fd */
>>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>> index 0458860bade2..c7915b95fec0 100644
>>> --- a/arch/arm64/kvm/pmu-emul.c
>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>>
>>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>>               pmu->pmc[i].idx = i;
>>> +
>>> +     pmu->num_events = perf_num_counters() - 1;
>>>  }
>>>
>>>  /**
>>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>
>>>               return 0;
>>>       }
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK << 
>>> ARMV8_PMU_PMCR_N_SHIFT;
>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>> +             u32 num_events;
>>> +
>>> +             if (get_user(num_events, uaddr))
>>> +                     return -EFAULT;
>>> +
>>> +             if (num_events >= perf_num_counters())
>>> +                     return -EINVAL;
>>> +
>>> +             vcpu->arch.pmu.num_events = num_events;
>>> +
>>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>>> +
>>> +             return 0;
>>> +     }
>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>               return kvm_arm_pmu_v3_init(vcpu);
>>>       }
>>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>               irq = vcpu->arch.pmu.irq_num;
>>>               return put_user(irq, uaddr);
>>>       }
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>> +             u32 num_events;
>>> +
>>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>> +                     return -ENODEV;
>>> +
>>> +             num_events = vcpu->arch.pmu.num_events;
>>> +             return put_user(num_events, uaddr);
>>> +     }
>>>       }
>>>
>>>       return -ENXIO;
>>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>>               if (kvm_arm_support_pmu_v3() &&
>>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>                       return 0;
>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>> index 20ab2a7d37ca..d51e39600bbd 100644
>>> --- a/arch/arm64/kvm/sys_regs.c
>>> +++ b/arch/arm64/kvm/sys_regs.c
>>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>>> const struct sys_reg_desc *r)
>>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & 
>>> (~ARMV8_PMU_PMCR_E);
>>>       if (!system_supports_32bit_el0())
>>>               val |= ARMV8_PMU_PMCR_LC;
>>> +
>>> +     /* Override number of event selectors */
>>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>>> +
>>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>>  }
>>>
>>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>>> index 98cbfe885a53..ea3fc96a37d9 100644
>>> --- a/include/kvm/arm_pmu.h
>>> +++ b/include/kvm/arm_pmu.h
>>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>>       bool ready;
>>>       bool created;
>>>       bool irq_level;
>>> +     u8 num_events;
>>>  };
>>>
>>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
>>
>> I see several problems with this approach:
>>
>> - userspace doesn't really have a good way to retrieve the number of
>>    counters.
> It does with v2, because it can then just read the register ;). I agree 
> that it's clunky though.
> 
>>
>> - Limiting the number of counters for the guest doesn't mean anything
>>    when it comes to the actual use of the HW counters, given that we
>>    don't allocate them ourselves (it's all perf doing the actual work).
> 
> We do cap the number of actively requestable counters via perf by the 
> PMCR.N limit. So in a way, it does mean something.
> 
>> - If you want to "pin" counters for the host, why don't you just do
>>    that before starting the guest?
> 
> You can do that. Imagine I have 10 counters. I pin 4 of them to the 
> host. I still tell my guest that it can use 6. That means perf will then 
> time slice and juggle 10 guest event counters on those remaining 6 
> hardware counters. That juggling heavily reduces accuracy.
> 
>> I think you need to look at the bigger picture: how to limit the use
>> of physical counter usage for a given userspace task. This needs
>> to happen in perf itself, and not in KVM.
> 
> That's definitely another way to look at it that I agree with.
> 
> What we really want is to expose the number of counters the guest has 
> available, not the number of counters hardware can support at maximum.
> 
> So in theory it would be enough to ask perf how many counters it does 
> have free for me to consume without overcommitting. But that would 
> potentially change between multiple invocations of KVM and thus break 
> things like live migration, no?
> 
> Maybe what we really want is an interface to perf from user space to say 
> "how many counters can you dedicate to me?" and "reserve them for me". 
> Then user space could reserve them as dedicated counters and KVM would 
> just need to either probe for the reservation or get told by user space 
> what to expose via ONE_REG as Drew suggested. It'd be up to user space 
> to ensure that the reservation matches the number of exposed counters then.

Note that if the aim is to avoid the guest seeing unexpectedly weird 
behaviour, then it's not just the *number* of counters that matters, but 
the underlying physical allocation too, thanks to the possibility of 
chained events.

Robin.