[PATCH v2] KVM/arm64: reconfigurate the event filters for guest context

Thu Aug 10 08:27:11 PDT 2023

Huang,

Please make sure you add everyone who commented on v1 (I've Cc'd Mark
so that he can shime need as needed).

On Thu, 10 Aug 2023 08:29:06 +0100,
Huang Shijie <shijie at os.amperecomputing.com> wrote:
> 
> 1.) Background.
>    1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>         and bind the guest to core 33 and run program "a" in guest.
>         The code of "a" shows below:
>    	----------------------------------------------------------
> 		#include <stdio.h>
> 
> 		int main()
> 		{
> 			unsigned long i = 0;
> 
> 			for (;;) {
> 				i++;
> 			}
> 
> 			printf("i:%ld\n", i);
> 			return 0;
> 		}
>    	----------------------------------------------------------
> 
>    1.2) Use the following perf command in host:
>       #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>           #           time             counts unit events
>                1.000817400      3,299,471,572      cycles:G
>                1.000817400          3,240,586      cycles:H
> 
>        This result is correct, my cpu's frequency is 3.3G.
> 
>    1.3) Use the following perf command in host:
>       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>             time             counts unit events
>      1.000831480        153,634,097      cycles:G                                                                (70.03%)
>      1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>      1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>      1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>      1.000831480    <not supported>      LLC-loads
>      1.000831480    <not supported>      LLC-load-misses
>      1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>      1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>      1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>      1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>      1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>      1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
> 
>        This result is wrong. The "cycle:G" should be nearly 3.3G.
> 
> 2.) Root cause.
> 	There is only 7 counters in my arm64 platform:
> 	  (one cycle counter) + (6 normal counters)
> 
> 	In 1.3 above, we will use 10 event counters.
> 	Since we only have 7 counters, the perf core will trigger
>        	multiplexing in hrtimer:
> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
> 
>        If the hrtimer occurs when the host is running, it's fine.
>        If the hrtimer occurs when the guest is running,
>        the perf_rotate_context() will program the PMU with filters for
>        host context. The KVM does not have a chance to restore
>        PMU registers with kvm_vcpu_pmu_restore_guest().
>        The PMU does not work correctly, so we got wrong result.
> 
> 3.) About this patch.
> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
> 	to reconfigurate the filters for guest context.
> 
> 4.) Test result of this patch:
>       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>             time             counts unit events
>      1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>      1.001006400          3,144,532      cycles:H                                                                (70.03%)
>      1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>      1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>      1.001006400    <not supported>      LLC-loads
>      1.001006400    <not supported>      LLC-load-misses
>      1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>      1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>      1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>      1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>      1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>      1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
> 
>     The result is correct. The "cycle:G" is nearly 3.3G now.
> 
> Signed-off-by: Huang Shijie <shijie at os.amperecomputing.com>
> ---
> v1 --> v2:
> 	Do not change perf/core code, only change the ARM64 kvm code.
> 	v1: https://lkml.org/lkml/2023/8/8/1465
> 
> ---
>  arch/arm64/kvm/arm.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index c2c14059f6a8..475a2f0e0e40 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		if (!ret)
>  			ret = 1;
>  
> -		if (ret > 0)
> +		if (ret > 0) {
> +			/*
> +			 * The perf_rotate_context() may rotate the events and
> +			 * reprogram PMU with filters for host context.
> +			 * So make a request before reentering the guest to
> +			 * reconfigurate the event filters for guest context.
> +			 */
> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> +
>  			ret = check_vcpu_requests(vcpu);
> +		}

This looks extremely heavy handed. You're performing the reload on
*every* entry, and I don't think this is right (exit-heavy workloads
will suffer from it)

Furthermore, you're also reloading the virtual state of the PMU
(recreating guest events and other things), all of which looks pretty
pointless, as all we're interested in is what is being counted on the
*host*.

Instead, we can restrict the reload of the host state (and only that)
to situations where:

- we're running on a VHE system

- we have a host PMUv3 (not everybody does), as that's the only way we
  can profile a guest

and ideally we would have a way to detect that a rotation happened
(which may requires some help from the low-level PMU code).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.