[PATCH 00/15] perf: KVM: Fix, optimize, and clean up callbacks

Like Xu like.xu.linux at gmail.com
Thu Aug 26 23:52:25 PDT 2021


+ STATIC BRANCH/CALL friends.

On 27/8/2021 8:57 am, Sean Christopherson wrote:
> This started out as a small series[1] to fix a KVM bug related to Intel PT
> interrupt handling and snowballed horribly.
> 
> The main problem being addressed is that the perf_guest_cbs are shared by
> all CPUs, can be nullified by KVM during module unload, and are not
> protected against concurrent access from NMI context.

Shouldn't this be a generic issue of the static_call() usage ?

At the beginning, we set up the static entry assuming perf_guest_cbs != NULL:

	if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) {
		static_call_update(x86_guest_handle_intel_pt_intr,
				   perf_guest_cbs->handle_intel_pt_intr);
	}

and then we unset the perf_guest_cbs and do the static function call like this:

DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, 
*(perf_guest_cbs->handle_intel_pt_intr));
static int handle_pmi_common(struct pt_regs *regs, u64 status)
{
		...
		if (!static_call(x86_guest_handle_intel_pt_intr)())
			intel_pt_interrupt();
		...
}

Can we make static_call() back to the original "(void *)&__static_call_return0" 
in this case ?

> 
> The bug has escaped notice because all dereferences of perf_guest_cbs
> follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
> and AFAICT the compiler never reloads perf_guest_cbs in this sequence.
> The compiler does reload perf_guest_cbs for any future dereferences, but
> the ->is_in_guest() guard all but guarantees the PMI handler will win the
> race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest
> and teardown down all VMs before it can be unloaded.
> 
> But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can
> trigger a NULL pointer derference (see below).  Manual intervention aside,
> the bug is a bit of a time bomb, e.g. my patch 3 from the original PT
> handling series would have omitted the ->is_in_guest() guard.
> 
> This series fixes the problem by making the callbacks per-CPU, and
> registering/unregistering the callbacks only with preemption disabled
> (except for the Xen case, which doesn't unregister).
> 
> This approach also allows for several nice cleanups in this series.
> KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat
> redundant current_vcpu, and the retpoline that is currently hit when KVM
> is loaded (due to always checking ->is_in_guest()) goes away (it's still
> there when running as Xen Dom0).
> 
> Changing to per-CPU callbacks also provides a good excuse to excise
> copy+paste code from architectures that can't possibly have guest
> callbacks.
> 
> This series conflicts horribly with a proposed patch[2] to use static
> calls for perf_guest_cbs.  But that patch is broken as it completely
> fails to handle unregister, and it's not clear to me whether or not
> it can correctly handle unregister without fixing the underlying race
> (I don't know enough about the code patching for static calls).
> 
> This tweak
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
>   {
>          int misc = 0;
> 
> -       if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> +       if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
>                  if (perf_guest_cbs->is_user_mode())
>                          misc |= PERF_RECORD_MISC_GUEST_USER;
>                  else
> 
> while spamming module load/unload leads to:
> 
>    BUG: kernel NULL pointer dereference, address: 0000000000000000
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page
>    PGD 0 P4D 0
>    Oops: 0000 [#1] PREEMPT SMP
>    CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    RIP: 0010:perf_misc_flags+0x1c/0x70
>    Call Trace:
>     perf_prepare_sample+0x53/0x6b0
>     perf_event_output_forward+0x67/0x160
>     __perf_event_overflow+0x52/0xf0
>     handle_pmi_common+0x207/0x300
>     intel_pmu_handle_irq+0xcf/0x410
>     perf_event_nmi_handler+0x28/0x50
>     nmi_handle+0xc7/0x260
>     default_do_nmi+0x6b/0x170
>     exc_nmi+0x103/0x130
>     asm_exc_nmi+0x76/0xbf
> 
> [1] https://lkml.kernel.org/r/20210823193709.55886-1-seanjc@google.com
> [2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan.zhu@intel.com
> 
> Sean Christopherson (15):
>    KVM: x86: Register perf callbacks after calling vendor's
>      hardware_setup()
>    KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
>      guest
>    perf: Stop pretending that perf can handle multiple guest callbacks
>    perf: Force architectures to opt-in to guest callbacks
>    perf: Track guest callbacks on a per-CPU basis
>    KVM: x86: Register perf callbacks only when actively handling
>      interrupt
>    KVM: Use dedicated flag to track if KVM is handling an NMI from guest
>    KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu
>    KVM: arm64: Register/unregister perf callbacks at vcpu load/put
>    KVM: Move x86's perf guest info callbacks to generic KVM
>    KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
>    KVM: arm64: Convert to the generic perf callbacks
>    KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c
>    perf: Disallow bulk unregistering of guest callbacks and do cleanup
>    perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback
> 
>   arch/arm/kernel/perf_callchain.c   | 28 ++------------
>   arch/arm64/Kconfig                 |  1 +
>   arch/arm64/include/asm/kvm_host.h  |  8 +++-
>   arch/arm64/kernel/perf_callchain.c | 18 ++++++---
>   arch/arm64/kvm/Makefile            |  2 +-
>   arch/arm64/kvm/arm.c               | 13 ++++++-
>   arch/arm64/kvm/perf.c              | 62 ------------------------------
>   arch/arm64/kvm/pmu.c               |  8 ++++
>   arch/csky/kernel/perf_callchain.c  | 10 -----
>   arch/nds32/kernel/perf_event_cpu.c | 29 ++------------
>   arch/riscv/kernel/perf_callchain.c | 10 -----
>   arch/x86/Kconfig                   |  1 +
>   arch/x86/events/core.c             | 17 +++++---
>   arch/x86/events/intel/core.c       |  7 ++--
>   arch/x86/include/asm/kvm_host.h    |  4 +-
>   arch/x86/kvm/pmu.c                 |  2 +-
>   arch/x86/kvm/pmu.h                 |  1 +
>   arch/x86/kvm/svm/svm.c             |  2 +-
>   arch/x86/kvm/vmx/vmx.c             | 25 ++++++++++--
>   arch/x86/kvm/x86.c                 | 54 +++-----------------------
>   arch/x86/kvm/x86.h                 | 12 +++---
>   arch/x86/xen/pmu.c                 |  2 +-
>   include/kvm/arm_pmu.h              |  1 +
>   include/linux/kvm_host.h           | 12 ++++++
>   include/linux/perf_event.h         | 33 ++++++++++++----
>   init/Kconfig                       |  3 ++
>   kernel/events/core.c               | 28 ++++++++------
>   virt/kvm/kvm_main.c                | 42 ++++++++++++++++++++
>   28 files changed, 204 insertions(+), 231 deletions(-)
>   delete mode 100644 arch/arm64/kvm/perf.c
> 



More information about the linux-riscv mailing list