[PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
Like Xu
like.xu.linux at gmail.com
Thu Aug 26 23:31:43 PDT 2021
On 27/8/2021 3:59 am, Sean Christopherson wrote:
> TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
> foundation that I would like to fix.
Obviously, this patch is not closely related to the guest PEBS feature enabling,
and we can certainly put this issue in another discussion thread [1].
[1] https://lore.kernel.org/kvm/20210827005718.585190-1-seanjc@google.com/
>
> On Fri, Aug 06, 2021, Zhu Lingshan wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 464917096e73..e466fc8176e1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
>> */
>> struct perf_guest_info_callbacks *perf_guest_cbs;
>>
>> +/* explicitly use __weak to fix duplicate symbol error */
>> +void __weak arch_perf_update_guest_cbs(void)
>> +{
>> +}
>> +
>> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>> {
>> + if (WARN_ON_ONCE(perf_guest_cbs))
>> + return -EBUSY;
>> +
>> perf_guest_cbs = cbs;
>> + arch_perf_update_guest_cbs();
>
> This is horribly broken, it fails to cleanup the static calls when KVM unregisters
> the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded.
> The explosion doesn't happen until 'kvm' is unloaded because the functions are
> implemented in 'kvm', i.e. the use-after-free is deferred a bit.
>
> BUG: unable to handle page fault for address: ffffffffa011bb90
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
> Oops: 0010 [#1] PREEMPT SMP
> CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:0xffffffffa011bb90
> Code: Unable to access opcode bytes at RIP 0xffffffffa011bb66.
> Call Trace:
> <NMI>
> ? perf_misc_flags+0xe/0x50
> ? perf_prepare_sample+0x53/0x6b0
> ? perf_event_output_forward+0x67/0x160
> ? kvm_clock_read+0x14/0x30
> ? kvm_sched_clock_read+0x5/0x10
> ? sched_clock_cpu+0xd/0xd0
> ? __perf_event_overflow+0x52/0xf0
> ? handle_pmi_common+0x1f2/0x2d0
> ? __flush_tlb_all+0x30/0x30
> ? intel_pmu_handle_irq+0xcf/0x410
> ? nmi_handle+0x5/0x260
> ? perf_event_nmi_handler+0x28/0x50
> ? nmi_handle+0xc7/0x260
> ? lock_release+0x2b0/0x2b0
> ? default_do_nmi+0x6b/0x170
> ? exc_nmi+0x103/0x130
> ? end_repeat_nmi+0x16/0x1f
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> </NMI>
> Modules linked in: irqbypass [last unloaded: kvm]
>
> Even more fun, the existing perf_guest_cbs framework is also broken, though it's
> much harder to get it to fail, and probably impossible to get it to fail without
> some help. The issue is that perf_guest_cbs is global, which means that it can
> be nullified by KVM (during module unload) while the callbacks are being accessed
> by a PMI handler on a different CPU.
>
> The bug has escaped notice because all dererfences of perf_guest_cbs follow the
> same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
> compiler never reload perf_guest_cbs in this sequence. The compiler does reload
> perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but
> guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
> KVM has to completely exit the guest and teardown down all VMs before it can be
> unloaded.
>
> But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
> a NULL pointer derference, e.g. this tweak
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
> {
> int misc = 0;
>
> - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
> if (perf_guest_cbs->is_user_mode())
> misc |= PERF_RECORD_MISC_GUEST_USER;
> else
>
>
> while spamming module load/unload leads to:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:perf_misc_flags+0x1c/0x70
> Call Trace:
> perf_prepare_sample+0x53/0x6b0
> perf_event_output_forward+0x67/0x160
> __perf_event_overflow+0x52/0xf0
> handle_pmi_common+0x207/0x300
> intel_pmu_handle_irq+0xcf/0x410
> perf_event_nmi_handler+0x28/0x50
> nmi_handle+0xc7/0x260
> default_do_nmi+0x6b/0x170
> exc_nmi+0x103/0x130
> asm_exc_nmi+0x76/0xbf
>
>
> The good news is that I have a series that should fix both the existing NULL pointer
> bug and mostly obviate the need for static calls. The bad news is that my approach,
> making perf_guest_cbs per-CPU, likely complicates turning these into static calls,
> though I'm guessing it's still a solvable problem.
>
> Tangentially related, IMO we should make architectures opt-in to getting
> perf_guest_cbs and nuke all of the code in the below files. Except for arm,
> which recently lost KVM support, it's all a bunch of useless copy-paste code that
> serves no purpose and just complicates cleanups like this.
>
>> arch/arm/kernel/perf_callchain.c | 16 +++++++-----
>> arch/csky/kernel/perf_callchain.c | 4 +--
>> arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
>> arch/riscv/kernel/perf_callchain.c | 4 +--
More information about the linux-riscv
mailing list