[PATCH v3 00/20] KVM: ARM64: Add guest PMU support

Mon Oct 26 18:15:09 PDT 2015


On 2015/10/26 19:33, Christoffer Dall wrote:
> On Thu, Sep 24, 2015 at 03:31:05PM -0700, Shannon Zhao wrote:
>> This patchset adds guest PMU support for KVM on ARM64. It takes
>> trap-and-emulate approach. When guest wants to monitor one event, it
>> will be trapped by KVM and KVM will call perf_event API to create a perf
>> event and call relevant perf_event APIs to get the count value of event.
>>
>> Use perf to test this patchset in guest. When using "perf list", it
>> shows the list of the hardware events and hardware cache events perf
>> supports. Then use "perf stat -e EVENT" to monitor some event. For
>> example, use "perf stat -e cycles" to count cpu cycles and
>> "perf stat -e cache-misses" to count cache misses.
>>
>> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
>> and guest.
>>
>> Host:
>>  Performance counter stats for 'sleep 5' (5 runs):
>>
>>           0.551428      task-clock (msec)         #    0.000 CPUs utilized            ( +-  0.91% )
>>                  1      context-switches          #    0.002 M/sec
>>                  0      cpu-migrations            #    0.000 K/sec
>>                 48      page-faults               #    0.088 M/sec                    ( +-  1.05% )
>>            1150265      cycles                    #    2.086 GHz                      ( +-  0.92% )
>>    <not supported>      stalled-cycles-frontend
>>    <not supported>      stalled-cycles-backend
>>             526398      instructions              #    0.46  insns per cycle          ( +-  0.89% )
>>    <not supported>      branches
>>               9485      branch-misses             #   17.201 M/sec                    ( +-  2.35% )
>>
>>        5.000831616 seconds time elapsed                                          ( +-  0.00% )
>>
>> Guest:
>>  Performance counter stats for 'sleep 5' (5 runs):
>>
>>           0.730868      task-clock (msec)         #    0.000 CPUs utilized            ( +-  1.13% )
>>                  1      context-switches          #    0.001 M/sec
>>                  0      cpu-migrations            #    0.000 K/sec
>>                 48      page-faults               #    0.065 M/sec                    ( +-  0.42% )
>>            1642982      cycles                    #    2.248 GHz                      ( +-  1.04% )
>>    <not supported>      stalled-cycles-frontend
>>    <not supported>      stalled-cycles-backend
>>             637964      instructions              #    0.39  insns per cycle          ( +-  0.65% )
>>    <not supported>      branches
>>              10377      branch-misses             #   14.198 M/sec                    ( +-  1.09% )
>>
>>        5.001289068 seconds time elapsed                                          ( +-  0.00% )
> 
> This looks pretty cool!
> 
> I'll review your next patch set version in more detail.
> 
> Have you tried runnig a no-op cycle counter read test in the guest and
> in the host?
> 
> Basically something like:
> 
> static void nop(void *junk)
> {
> }
> 
> static void test_nop(void)
> {
> 	unsigned long before,after;
> 	before = read_cycles();
> 	isb();
> 	nop(NULL);
> 	isb();
> 	after = read_cycles();
> }
> 
> I would be very curious to see if we get a ~6000 cycles overhead in the
> guest compared to bare-metal, which I expect.
> 
Ok, I'll try this while I'm doing more tests on v4.

> If we do, we should consider a hot-path in the the EL2 assembly code to
> read the cycle counter to reduce the overhead to something more precise.
>  

-- 
Shannon